-
Notifications
You must be signed in to change notification settings - Fork 169
High load and CPU usage after upgrading from promscale 0.3.0 to 0.10.0 #1221
Description
Hi all,
I am facing an unusual high load and CPU usage after I upgraded from promscale version 0.3.0 to version 0.10.0
I am running this environment:
promscale_0.10.0_Linux_x86_64
postgresql_version: 12.10-1.pgdg100+1
timescaledb_version: 2.6.0~debian10
promscale_extension: 0.3.0
OS: Debian GNU/Linux 10
Server: 6 core CPU, 32GB RAM
Nothing seems to be changed on the metrics being pushed to promscale with the same samples/sec as before:
Mär 03 11:18:58 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:18:58.975Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=3500 metrics-max-sent-ts=2022-03-03T10:18:58.626Z
Mär 03 11:18:59 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:18:59.975Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=4500 metrics-max-sent-ts=2022-03-03T10:18:59.803Z
Mär 03 11:19:00 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:19:00.976Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=2000 metrics-max-sent-ts=2022-03-03T10:19:00.153Z
Mär 03 11:19:01 timescaledb promscale_0.10.0_Linux_x86_64[29503]: level=info ts=2022-03-03T10:19:01.982Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=3000 metrics-max-sent-ts=2022-03-03T10:19:01.862Z
# select oid, extname, extowner, extnamespace, extrelocatable, extversion from pg_extension;
oid | extname | extowner | extnamespace | extrelocatable | extversion
----------+---------------+----------+--------------+----------------+------------
13398 | plpgsql | 10 | 11 | f | 1.0
16936 | pg_prometheus | 10 | 2200 | t | 0.2.2
16385 | timescaledb | 10 | 2200 | f | 2.6.0
18042951 | promscale | 10 | 17009 | f | 0.3.0
# select * from _prom_catalog.default;
key | value
------------------------+----------
chunk_interval | 08:00:00
metric_compression | true
ha_lease_timeout | 1m
ha_lease_refresh | 10s
retention_period | 1 year
trace_retention_period | 30 days
# SELECT pg_size_pretty( pg_database_size('timescaledb') );
pg_size_pretty
----------------
96 GB
The issue started to happen only after the weekly full backup failed to finish:
pg_dump -d ${DATABASE_NAME} -U $PG_USER -w -j 4 -F d -f ${DB_BACKUP_PATH}/${TODAY}
pg_dump was being executed for 19 hours until it failed, from the normal 2-3 hours it took before.
Now I am seeing a lot of these autovacuum worker
processes which run all the time that promscale binary is running:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4743 postgres 20 0 8948700 3,1g 3,0g R 80,5 10,0 33:55.03 postgres: 12/main: User-Defined Action [1001]
688 postgres 20 0 276040 208132 4120 R 73,2 0,6 2608:37 postgres: 12/main: stats collector
5704 postgres 20 0 8941548 2,8g 2,8g R 68,3 9,0 31:43.64 postgres: 12/main: User-Defined Action [1000]
8599 postgres 20 0 9,9g 9,4g 7,9g R 63,4 29,8 3:14.63 postgres: 12/main: postgres timescaledb 127.0.0.1(40138) SELECT
31811 postgres 20 0 9457284 6,8g 6,2g R 49,6 21,5 484:19.82 postgres: 12/main: autovacuum worker timescaledb
31988 postgres 20 0 9275104 3,7g 3,3g S 42,3 11,9 423:45.72 postgres: 12/main: autovacuum worker timescaledb
7017 postgres 20 0 9275104 5,6g 5,2g R 26,8 17,7 410:19.93 postgres: 12/main: autovacuum worker timescaledb
29373 postgres 20 0 9262816 6,8g 6,4g R 26,8 21,7 426:21.42 postgres: 12/main: autovacuum worker timescaledb
16236 postgres 20 0 9279712 5,8g 5,4g R 25,2 18,4 402:31.58 postgres: 12/main: autovacuum worker timescaledb
844 postgres 20 0 9266912 3,4g 3,0g R 24,4 10,7 422:09.85 postgres: 12/main: autovacuum worker timescaledb
9826 postgres 20 0 9275104 4,9g 4,5g R 24,4 15,6 410:25.32 postgres: 12/main: autovacuum worker timescaledb
1709 postgres 20 0 9258720 3,4g 3,0g R 23,6 10,9 421:10.75 postgres: 12/main: autovacuum worker timescaledb
30416 postgres 20 0 9328352 6,9g 6,4g R 22,8 21,9 336:59.67 postgres: 12/main: autovacuum worker timescaledb
31812 postgres 20 0 9365280 7,2g 6,7g R 22,0 22,9 484:05.87 postgres: 12/main: autovacuum worker timescaledb
Note that I do not see this issue on another similar machine which runs promscale version 0.6.0, and also running psql -c 'CALL prom_api.execute_maintenance();'"
it took ~30mins before and now it takes ~1hr.