-
Notifications
You must be signed in to change notification settings - Fork 972
Fix interrupted CAgg refresh materialization phase #8607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix interrupted CAgg refresh materialization phase #8607
Conversation
8bf1d9d
to
859f785
Compare
3be3183
to
37781ab
Compare
37781ab
to
8325de1
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8607 +/- ##
==========================================
+ Coverage 82.41% 82.50% +0.09%
==========================================
Files 248 248
Lines 47672 47671 -1
Branches 12116 12115 -1
==========================================
+ Hits 39289 39332 +43
- Misses 3493 3507 +14
+ Partials 4890 4832 -58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
In timescale#8514 we improved concurrent CAgg refreshes by splitting the second transaction (invalidation processing and data materialization) into two separated transactions. But now when interrupting the third transaction (data materialization) we'll left behind pending materialization ranges in the new metada table `continuous_aggs_materialization_ranges`. Fixed it by properly checking the existance of pending materialization ranges and if it exists execute the materialization.
8325de1
to
1d003f3
Compare
Can we also add a regression test which tests this? We can try to replicate this scenario by manually inserting data into the materialization ranges table. |
@fabriziomello I saw the isolation tests, but I was thinking of these test cases as well:
|
Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges. The item 1 is already covered by the isolation test. I can add an extra step in the isolation test to process another non-overlaping pending range. |
It's possible for an insert to occur into the hypertable between the previous refresh being terminated, and the next refresh starting. In this case, the subsequent refresh would have to process the new invalidation log as well as the pending ranges.
I don't think the test added covers this case. Since the third transaction of the refresh is terminated, it has already processed the invalidation logs and only pending ranges would be processed by the next refresh. |
Ahhhh I got your point... you want to make sure we're not processing materializations that are not related (non-overlap) to the refresh window and in the other way around. Tested it manually and it worked, do you think it is mandatory for merging this PR to also add this new permutations? |
I think it's fine to add the tests with a follow-up PR since we want to include this in the 2.22.1 release. |
PR timescale#8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized)
PR timescale#8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized)
PR timescale#8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized)
PR #8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized) Disable-check: force-changelog-file
PR #8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized) Disable-check: force-changelog-file (cherry picked from commit 9a47b63)
PR #8607 addressed the issue when a failed refresh left behind pending materializations (i.e., rows in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch searched for those pending materializations that had some overlap with the current refresh's refresh window. However that PR used the refresh window that was passed as reference to process_cagg_invalidations_and_refresh and as such, was modified by that function to match with the invalidated buckets. This PR changes it to use the original refresh window (from the policy) so that it is more likely to overlap with the pending materializations (and more deterministic, since it doesn't depend on the data being materialized) Disable-check: force-changelog-file (cherry picked from commit 9a47b63)
## 2.22.1 (2025-09-30) This release contains performance improvements and bug fixes since the [2.22.0](https://github.com/timescale/timescaledb/releases/tag/2.20.0) release. We recommend that you upgrade at the next available opportunity. This release blocks the ability to leverage **concurrent refresh policies** in **hierarchical continous aggregates**, as potential deadlocks can occur. If you have [concurrent refresh policies](https://docs.tigerdata.com/use-timescale/latest/continuous-aggregates/refresh-policies/#add-concurrent-refresh-policies) in **hierarchical** continous aggregates, [please disable the jobs](https://docs.tigerdata.com/api/latest/jobs-automation/alter_job/#samples), as following: ``` SELECT alter_job("<job_id_of_concurrent_policy>", scheduled => false); ``` **Bugfixes** * [#7766](#7766) Load OSM extension in retention background worker to drop tiered chunks * [#8550](#8550) Error in gapfill with expressions over aggregates and groupby columns and out-of-order columns * [#8593](#8593) Error on change of invalidation method for continuous aggregate * [#8599](#8599) Fix attnum mismatch bug in chunk constraint checks * [#8607](#8607) Fix interrupted continous aggregate refresh materialization phase leaving behind pending materialization ranges * [#8638](#8638) `ALTER TABLE RESET` for `orderby` settings * [#8644](#8644) Fix migration script for sparse index configuration * [#8657](#8657) Fix `CREATE TABLE WITH` when using UUIDv7 partitioning * [#8659](#8659) Don't propagate `ALTER TABLE` commands to foreign data wrapper chunks * [#8693](#8693) Compressed index not chosen for `varchar` typed `segmentby` columns * [#8707](#8707) Block concurrent refresh policies for hierarchical continous aggregate due to potential deadlocks **Thanks** * @MKrkkl for reporting a bug in Gapfill queries with expressions over aggregates and groupby columns * @brandonpurcell-dev for creating a test case that showed a bug in `CREATE TABLE WITH` when using UUIDv7 partitioning * @snyrkill for reporting a bug when interrupting a continous aggregate refresh --------- Signed-off-by: Philip Krauss <[email protected]> Co-authored-by: timescale-automation <123763385+github-actions[bot]@users.noreply.github.com> Co-authored-by: philkra <[email protected]> Co-authored-by: Philip Krauss <[email protected]> Co-authored-by: Iain Cox <[email protected]>
## 2.22.1 (2025-09-30) This release contains performance improvements and bug fixes since the [2.22.0](https://github.com/timescale/timescaledb/releases/tag/2.20.0) release. We recommend that you upgrade at the next available opportunity. This release blocks the ability to leverage **concurrent refresh policies** in **hierarchical continous aggregates**, as potential deadlocks can occur. If you have [concurrent refresh policies](https://docs.tigerdata.com/use-timescale/latest/continuous-aggregates/refresh-policies/#add-concurrent-refresh-policies) in **hierarchical** continous aggregates, [please disable the jobs](https://docs.tigerdata.com/api/latest/jobs-automation/alter_job/#samples), as following: ``` SELECT alter_job("<job_id_of_concurrent_policy>", scheduled => false); ``` **Bugfixes** * [#7766](#7766) Load OSM extension in retention background worker to drop tiered chunks * [#8550](#8550) Error in gapfill with expressions over aggregates and groupby columns and out-of-order columns * [#8593](#8593) Error on change of invalidation method for continuous aggregate * [#8599](#8599) Fix attnum mismatch bug in chunk constraint checks * [#8607](#8607) Fix interrupted continous aggregate refresh materialization phase leaving behind pending materialization ranges * [#8638](#8638) `ALTER TABLE RESET` for `orderby` settings * [#8644](#8644) Fix migration script for sparse index configuration * [#8657](#8657) Fix `CREATE TABLE WITH` when using UUIDv7 partitioning * [#8659](#8659) Don't propagate `ALTER TABLE` commands to foreign data wrapper chunks * [#8693](#8693) Compressed index not chosen for `varchar` typed `segmentby` columns * [#8707](#8707) Block concurrent refresh policies for hierarchical continous aggregate due to potential deadlocks **Thanks** * @MKrkkl for reporting a bug in Gapfill queries with expressions over aggregates and groupby columns * @brandonpurcell-dev for creating a test case that showed a bug in `CREATE TABLE WITH` when using UUIDv7 partitioning * @snyrkill for reporting a bug when interrupting a continous aggregate refresh --------- Signed-off-by: Philip Krauss <[email protected]> Co-authored-by: timescale-automation <123763385+github-actions[bot]@users.noreply.github.com> Co-authored-by: philkra <[email protected]> Co-authored-by: Philip Krauss <[email protected]> Co-authored-by: Iain Cox <[email protected]>
In #8514 we improved concurrent CAgg refreshes by splitting the second transaction (invalidation processing and data materialization) into two separated transactions. But now when interrupting the third transaction (data materialization) we'll left behind pending materialization ranges in the new metada table
continuous_aggs_materialization_ranges
.Fixed it by properly checking the existance of pending materialization ranges and if it exists execute the materialization.
Disable-check: commit-count
Fixes #8591