Fix interrupted CAgg refresh materialization phase #8607

fabriziomello · 2025-09-06T23:56:26Z

In #8514 we improved concurrent CAgg refreshes by splitting the second transaction (invalidation processing and data materialization) into two separated transactions. But now when interrupting the third transaction (data materialization) we'll left behind pending materialization ranges in the new metada table continuous_aggs_materialization_ranges.

Fixed it by properly checking the existance of pending materialization ranges and if it exists execute the materialization.

Disable-check: commit-count

Fixes #8591

codecov · 2025-09-07T22:16:33Z

Codecov Report

❌ Patch coverage is 86.20690% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.50%. Comparing base (0f7b4ca) to head (f8884b9).
⚠️ Report is 58 commits behind head on main.

Files with missing lines	Patch %	Lines
tsl/src/continuous_aggs/materialize.c	76.47%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8607      +/-   ##
==========================================
+ Coverage   82.41%   82.50%   +0.09%     
==========================================
  Files         248      248              
  Lines       47672    47671       -1     
  Branches    12116    12115       -1     
==========================================
+ Hits        39289    39332      +43     
- Misses       3493     3507      +14     
+ Partials     4890     4832      -58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

In timescale#8514 we improved concurrent CAgg refreshes by splitting the second transaction (invalidation processing and data materialization) into two separated transactions. But now when interrupting the third transaction (data materialization) we'll left behind pending materialization ranges in the new metada table `continuous_aggs_materialization_ranges`. Fixed it by properly checking the existance of pending materialization ranges and if it exists execute the materialization.

tsl/src/continuous_aggs/refresh.c

kpan2034 · 2025-09-08T16:39:23Z

Can we also add a regression test which tests this? We can try to replicate this scenario by manually inserting data into the materialization ranges table.

fabriziomello · 2025-09-08T18:08:29Z

Can we also add a regression test which tests this? We can try to replicate this scenario by manually inserting data into the materialization ranges table.

Well I've added isolation tests for it, isn't it enough?

Check there's no pending ranges for refresh: https://github.com/timescale/timescaledb/pull/8607/files#diff-ab434bc8b26f02a0749f1d52567d4f27cef16cd9a72a14919fa0a1093fd03e22R695-R710
Refresh and wait before process the ranges: https://github.com/timescale/timescaledb/pull/8607/files#diff-ab434bc8b26f02a0749f1d52567d4f27cef16cd9a72a14919fa0a1093fd03e22R712-R714
Cancel the refresh terminating the session: https://github.com/timescale/timescaledb/pull/8607/files#diff-ab434bc8b26f02a0749f1d52567d4f27cef16cd9a72a14919fa0a1093fd03e22R715-R722
Check there are pending ranges to refresh: https://github.com/timescale/timescaledb/pull/8607/files#diff-ab434bc8b26f02a0749f1d52567d4f27cef16cd9a72a14919fa0a1093fd03e22R737-R739
Refresh again and don't emit the up-to-date message: https://github.com/timescale/timescaledb/pull/8607/files#diff-ab434bc8b26f02a0749f1d52567d4f27cef16cd9a72a14919fa0a1093fd03e22R751-R752
No more pending ranges to materialize: https://github.com/timescale/timescaledb/pull/8607/files#diff-ab434bc8b26f02a0749f1d52567d4f27cef16cd9a72a14919fa0a1093fd03e22R754-R769

tsl/src/continuous_aggs/refresh.c

kpan2034 · 2025-09-08T19:37:45Z

@fabriziomello I saw the isolation tests, but I was thinking of these test cases as well:

There are ranges in the invalidation log and pending ranges in the materialization ranges table which overlap
There are ranges in the invalidation log and pending ranges in the materialization ranges table which do not overlap

tsl/src/continuous_aggs/materialize.c

fabriziomello · 2025-09-08T19:47:05Z

@fabriziomello I saw the isolation tests, but I was thinking of these test cases as well:

There are ranges in the invalidation log and pending ranges in the materialization ranges table which overlap

There are ranges in the invalidation log and pending ranges in the materialization ranges table which do not overlap

Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges.

The item 1 is already covered by the isolation test. I can add an extra step in the isolation test to process another non-overlaping pending range.

kpan2034 · 2025-09-08T20:02:15Z

Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges.

It's possible for an insert to occur into the hypertable between the previous refresh being terminated, and the next refresh starting. In this case, the subsequent refresh would have to process the new invalidation log as well as the pending ranges.

item 1 is already covered by the isolation test

I don't think the test added covers this case. Since the third transaction of the refresh is terminated, it has already processed the invalidation logs and only pending ranges would be processed by the next refresh.

fabriziomello · 2025-09-08T20:19:18Z

Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges.

It's possible for an insert to occur into the hypertable between the previous refresh being terminated, and the next refresh starting. In this case, the subsequent refresh would have to process the new invalidation log as well as the pending ranges.

item 1 is already covered by the isolation test

I don't think the test added covers this case. Since the third transaction of the refresh is terminated, it has already processed the invalidation logs and only pending ranges would be processed by the next refresh.

Ahhhh I got your point... you want to make sure we're not processing materializations that are not related (non-overlap) to the refresh window and in the other way around.

Tested it manually and it worked, do you think it is mandatory for merging this PR to also add this new permutations?

kpan2034 · 2025-09-08T20:35:46Z

I think it's fine to add the tests with a follow-up PR since we want to include this in the 2.22.1 release.

PR timescale#8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized)

PR #8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized) Disable-check: force-changelog-file

PR #8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized) Disable-check: force-changelog-file (cherry picked from commit 9a47b63)

@MKrkkl

## 2.22.1 (2025-09-30) This release contains performance improvements and bug fixes since the [2.22.0](https://github.com/timescale/timescaledb/releases/tag/2.20.0) release. We recommend that you upgrade at the next available opportunity. This release blocks the ability to leverage **concurrent refresh policies** in **hierarchical continous aggregates**, as potential deadlocks can occur. If you have [concurrent refresh policies](https://docs.tigerdata.com/use-timescale/latest/continuous-aggregates/refresh-policies/#add-concurrent-refresh-policies) in **hierarchical** continous aggregates, [please disable the jobs](https://docs.tigerdata.com/api/latest/jobs-automation/alter_job/#samples), as following: ``` SELECT alter_job("<job_id_of_concurrent_policy>", scheduled => false); ``` **Bugfixes** * [#7766](#7766) Load OSM extension in retention background worker to drop tiered chunks * [#8550](#8550) Error in gapfill with expressions over aggregates and groupby columns and out-of-order columns * [#8593](#8593) Error on change of invalidation method for continuous aggregate * [#8599](#8599) Fix attnum mismatch bug in chunk constraint checks * [#8607](#8607) Fix interrupted continous aggregate refresh materialization phase leaving behind pending materialization ranges * [#8638](#8638) `ALTER TABLE RESET` for `orderby` settings * [#8644](#8644) Fix migration script for sparse index configuration * [#8657](#8657) Fix `CREATE TABLE WITH` when using UUIDv7 partitioning * [#8659](#8659) Don't propagate `ALTER TABLE` commands to foreign data wrapper chunks * [#8693](#8693) Compressed index not chosen for `varchar` typed `segmentby` columns * [#8707](#8707) Block concurrent refresh policies for hierarchical continous aggregate due to potential deadlocks **Thanks** * @MKrkkl for reporting a bug in Gapfill queries with expressions over aggregates and groupby columns * @brandonpurcell-dev for creating a test case that showed a bug in `CREATE TABLE WITH` when using UUIDv7 partitioning * @snyrkill for reporting a bug when interrupting a continous aggregate refresh --------- Signed-off-by: Philip Krauss <[email protected]> Co-authored-by: timescale-automation <123763385+github-actions[bot]@users.noreply.github.com> Co-authored-by: philkra <[email protected]> Co-authored-by: Philip Krauss <[email protected]> Co-authored-by: Iain Cox <[email protected]>

@MKrkkl

## 2.22.1 (2025-09-30) This release contains performance improvements and bug fixes since the [2.22.0](https://github.com/timescale/timescaledb/releases/tag/2.20.0) release. We recommend that you upgrade at the next available opportunity. This release blocks the ability to leverage **concurrent refresh policies** in **hierarchical continous aggregates**, as potential deadlocks can occur. If you have [concurrent refresh policies](https://docs.tigerdata.com/use-timescale/latest/continuous-aggregates/refresh-policies/#add-concurrent-refresh-policies) in **hierarchical** continous aggregates, [please disable the jobs](https://docs.tigerdata.com/api/latest/jobs-automation/alter_job/#samples), as following: ``` SELECT alter_job("<job_id_of_concurrent_policy>", scheduled => false); ``` **Bugfixes** * [#7766](#7766) Load OSM extension in retention background worker to drop tiered chunks * [#8550](#8550) Error in gapfill with expressions over aggregates and groupby columns and out-of-order columns * [#8593](#8593) Error on change of invalidation method for continuous aggregate * [#8599](#8599) Fix attnum mismatch bug in chunk constraint checks * [#8607](#8607) Fix interrupted continous aggregate refresh materialization phase leaving behind pending materialization ranges * [#8638](#8638) `ALTER TABLE RESET` for `orderby` settings * [#8644](#8644) Fix migration script for sparse index configuration * [#8657](#8657) Fix `CREATE TABLE WITH` when using UUIDv7 partitioning * [#8659](#8659) Don't propagate `ALTER TABLE` commands to foreign data wrapper chunks * [#8693](#8693) Compressed index not chosen for `varchar` typed `segmentby` columns * [#8707](#8707) Block concurrent refresh policies for hierarchical continous aggregate due to potential deadlocks **Thanks** * @MKrkkl for reporting a bug in Gapfill queries with expressions over aggregates and groupby columns * @brandonpurcell-dev for creating a test case that showed a bug in `CREATE TABLE WITH` when using UUIDv7 partitioning * @snyrkill for reporting a bug when interrupting a continous aggregate refresh --------- Signed-off-by: Philip Krauss <[email protected]> Co-authored-by: timescale-automation <123763385+github-actions[bot]@users.noreply.github.com> Co-authored-by: philkra <[email protected]> Co-authored-by: Philip Krauss <[email protected]> Co-authored-by: Iain Cox <[email protected]>

fabriziomello self-assigned this Sep 6, 2025

fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch from 8bf1d9d to 859f785 Compare September 7, 2025 21:44

fabriziomello added Continuous Aggregate bug labels Sep 7, 2025

fabriziomello changed the title ~~Fix cancelled CAgg refresh materialization phase~~ Fix interrupted CAgg refresh materialization phase Sep 7, 2025

fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch 3 times, most recently from 3be3183 to 37781ab Compare September 7, 2025 21:59

fabriziomello marked this pull request as ready for review September 7, 2025 22:00

fabriziomello requested review from mkindahl, kpan2034 and pnthao September 7, 2025 22:00

fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch from 37781ab to 8325de1 Compare September 7, 2025 22:02

fabriziomello added this to the v2.22.1 milestone Sep 7, 2025

svenklemm approved these changes Sep 8, 2025

View reviewed changes

fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch from 8325de1 to 1d003f3 Compare September 8, 2025 14:38

kpan2034 reviewed Sep 8, 2025

View reviewed changes

tsl/src/continuous_aggs/refresh.c Show resolved Hide resolved

pnthao reviewed Sep 8, 2025

View reviewed changes

tsl/src/continuous_aggs/refresh.c Show resolved Hide resolved

pnthao approved these changes Sep 8, 2025

View reviewed changes

kpan2034 reviewed Sep 8, 2025

View reviewed changes

tsl/src/continuous_aggs/materialize.c Show resolved Hide resolved

Addressed @kpan2034 reviews

f8884b9

kpan2034 approved these changes Sep 8, 2025

View reviewed changes

fabriziomello merged commit a0a109d into timescale:main Sep 8, 2025
46 of 47 checks passed

timescale-automation mentioned this pull request Sep 8, 2025

Backport to 2.22.x: #8607: Fix interrupted CAgg refresh materialization phase #8614

Merged

timescale-automation added the backported-2.22.x label Sep 8, 2025

pnthao mentioned this pull request Sep 18, 2025

Improve pending materializations search #8647

Merged

timescale-automation mentioned this pull request Sep 19, 2025

Backport to 2.22.x: #8647: Improve pending materializations search #8652

Merged

philkra mentioned this pull request Sep 30, 2025

Release 2.22.1 #8717

Merged

timescale-automation added the released-2.22.1 Released in 2.22.1 label Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix interrupted CAgg refresh materialization phase #8607

Fix interrupted CAgg refresh materialization phase #8607

Uh oh!

fabriziomello commented Sep 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

fabriziomello commented Sep 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

Uh oh!

fabriziomello commented Sep 8, 2025

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

fabriziomello commented Sep 8, 2025

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Fix interrupted CAgg refresh materialization phase #8607

Fix interrupted CAgg refresh materialization phase #8607

Uh oh!

Conversation

fabriziomello commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

fabriziomello commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

Uh oh!

fabriziomello commented Sep 8, 2025

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

fabriziomello commented Sep 8, 2025

Uh oh!

kpan2034 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

fabriziomello commented Sep 6, 2025 •

edited

Loading

codecov bot commented Sep 7, 2025 •

edited

Loading

fabriziomello commented Sep 8, 2025 •

edited

Loading