Skip to content
This repository was archived by the owner on Apr 2, 2024. It is now read-only.
This repository was archived by the owner on Apr 2, 2024. It is now read-only.

High load and CPU usage after upgrading from promscale 0.3.0 to 0.10.0 #1221

@enidvrenozaj

Description

@enidvrenozaj

Hi all,
I am facing an unusual high load and CPU usage after I upgraded from promscale version 0.3.0 to version 0.10.0
I am running this environment:

promscale_0.10.0_Linux_x86_64
postgresql_version: 12.10-1.pgdg100+1
timescaledb_version: 2.6.0~debian10
promscale_extension: 0.3.0
OS: Debian GNU/Linux 10
Server: 6 core CPU, 32GB RAM

Nothing seems to be changed on the metrics being pushed to promscale with the same samples/sec as before:

Mär 03 11:18:58 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:18:58.975Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=3500 metrics-max-sent-ts=2022-03-03T10:18:58.626Z
Mär 03 11:18:59 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:18:59.975Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=4500 metrics-max-sent-ts=2022-03-03T10:18:59.803Z
Mär 03 11:19:00 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:19:00.976Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=2000 metrics-max-sent-ts=2022-03-03T10:19:00.153Z
Mär 03 11:19:01 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:19:01.982Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=3000 metrics-max-sent-ts=2022-03-03T10:19:01.862Z
# select oid, extname, extowner, extnamespace, extrelocatable, extversion from pg_extension;
   oid    |    extname    | extowner | extnamespace | extrelocatable | extversion 
----------+---------------+----------+--------------+----------------+------------
    13398 | plpgsql       |       10 |           11 | f              | 1.0
    16936 | pg_prometheus |       10 |         2200 | t              | 0.2.2
    16385 | timescaledb   |       10 |         2200 | f              | 2.6.0
 18042951 | promscale     |       10 |        17009 | f              | 0.3.0
# select * from _prom_catalog.default;
          key           |  value   
------------------------+----------
 chunk_interval         | 08:00:00
 metric_compression     | true
 ha_lease_timeout       | 1m
 ha_lease_refresh       | 10s
 retention_period       | 1 year
 trace_retention_period | 30 days
# SELECT pg_size_pretty( pg_database_size('timescaledb') );
 pg_size_pretty 
----------------
 96 GB

The issue started to happen only after the weekly full backup failed to finish:
pg_dump -d ${DATABASE_NAME} -U $PG_USER -w -j 4 -F d -f ${DB_BACKUP_PATH}/${TODAY}

pg_dump was being executed for 19 hours until it failed, from the normal 2-3 hours it took before.

Now I am seeing a lot of these autovacuum worker processes which run all the time that promscale binary is running:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     

 4743 postgres  20   0 8948700   3,1g   3,0g R  80,5  10,0  33:55.03 postgres: 12/main: User-Defined Action [1001]                                                                                               
  688 postgres  20   0  276040 208132   4120 R  73,2   0,6   2608:37 postgres: 12/main: stats collector                                                                                                          
 5704 postgres  20   0 8941548   2,8g   2,8g R  68,3   9,0  31:43.64 postgres: 12/main: User-Defined Action [1000]                                                                                               
 8599 postgres  20   0    9,9g   9,4g   7,9g R  63,4  29,8   3:14.63 postgres: 12/main: postgres timescaledb 127.0.0.1(40138) SELECT                                                                        
31811 postgres  20   0 9457284   6,8g   6,2g R  49,6  21,5 484:19.82 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
31988 postgres  20   0 9275104   3,7g   3,3g S  42,3  11,9 423:45.72 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
 7017 postgres  20   0 9275104   5,6g   5,2g R  26,8  17,7 410:19.93 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
29373 postgres  20   0 9262816   6,8g   6,4g R  26,8  21,7 426:21.42 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
16236 postgres  20   0 9279712   5,8g   5,4g R  25,2  18,4 402:31.58 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
  844 postgres  20   0 9266912   3,4g   3,0g R  24,4  10,7 422:09.85 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
 9826 postgres  20   0 9275104   4,9g   4,5g R  24,4  15,6 410:25.32 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
 1709 postgres  20   0 9258720   3,4g   3,0g R  23,6  10,9 421:10.75 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
30416 postgres  20   0 9328352   6,9g   6,4g R  22,8  21,9 336:59.67 postgres: 12/main: autovacuum worker   timescaledb                                                                                     
31812 postgres  20   0 9365280   7,2g   6,7g R  22,0  22,9 484:05.87 postgres: 12/main: autovacuum worker   timescaledb 

Note that I do not see this issue on another similar machine which runs promscale version 0.6.0, and also running psql -c 'CALL prom_api.execute_maintenance();'" it took ~30mins before and now it takes ~1hr.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceImprovements that are specifically related to performanceneed-more-info

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions