Skip to content

Conversation

fabriziomello
Copy link
Member

@fabriziomello fabriziomello commented Sep 6, 2025

In #8514 we improved concurrent CAgg refreshes by splitting the second transaction (invalidation processing and data materialization) into two separated transactions. But now when interrupting the third transaction (data materialization) we'll left behind pending materialization ranges in the new metada table continuous_aggs_materialization_ranges.

Fixed it by properly checking the existance of pending materialization ranges and if it exists execute the materialization.

Disable-check: commit-count

Fixes #8591

@fabriziomello fabriziomello self-assigned this Sep 6, 2025
@fabriziomello fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch from 8bf1d9d to 859f785 Compare September 7, 2025 21:44
@fabriziomello fabriziomello changed the title Fix cancelled CAgg refresh materialization phase Fix interrupted CAgg refresh materialization phase Sep 7, 2025
@fabriziomello fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch 3 times, most recently from 3be3183 to 37781ab Compare September 7, 2025 21:59
@fabriziomello fabriziomello marked this pull request as ready for review September 7, 2025 22:00
@fabriziomello fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch from 37781ab to 8325de1 Compare September 7, 2025 22:02
@fabriziomello fabriziomello added this to the v2.22.1 milestone Sep 7, 2025
Copy link

codecov bot commented Sep 7, 2025

Codecov Report

❌ Patch coverage is 86.20690% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.50%. Comparing base (0f7b4ca) to head (f8884b9).
⚠️ Report is 58 commits behind head on main.

Files with missing lines Patch % Lines
tsl/src/continuous_aggs/materialize.c 76.47% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8607      +/-   ##
==========================================
+ Coverage   82.41%   82.50%   +0.09%     
==========================================
  Files         248      248              
  Lines       47672    47671       -1     
  Branches    12116    12115       -1     
==========================================
+ Hits        39289    39332      +43     
- Misses       3493     3507      +14     
+ Partials     4890     4832      -58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

In timescale#8514 we improved concurrent CAgg refreshes by splitting the second
transaction (invalidation processing and data materialization) into two
separated transactions. But now when interrupting the third transaction
(data materialization) we'll left behind pending materialization ranges
in the new metada table `continuous_aggs_materialization_ranges`.

Fixed it by properly checking the existance of pending materialization
ranges and if it exists execute the materialization.
@fabriziomello fabriziomello force-pushed the fix_cancelled_cagg_refresh_materialization branch from 8325de1 to 1d003f3 Compare September 8, 2025 14:38
@kpan2034
Copy link
Member

kpan2034 commented Sep 8, 2025

Can we also add a regression test which tests this? We can try to replicate this scenario by manually inserting data into the materialization ranges table.

@fabriziomello
Copy link
Member Author

fabriziomello commented Sep 8, 2025

@kpan2034
Copy link
Member

kpan2034 commented Sep 8, 2025

@fabriziomello I saw the isolation tests, but I was thinking of these test cases as well:

  1. There are ranges in the invalidation log and pending ranges in the materialization ranges table which overlap
  2. There are ranges in the invalidation log and pending ranges in the materialization ranges table which do not overlap

@fabriziomello
Copy link
Member Author

@fabriziomello I saw the isolation tests, but I was thinking of these test cases as well:

  1. There are ranges in the invalidation log and pending ranges in the materialization ranges table which overlap
  2. There are ranges in the invalidation log and pending ranges in the materialization ranges table which do not overlap

Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges.

The item 1 is already covered by the isolation test. I can add an extra step in the isolation test to process another non-overlaping pending range.

@kpan2034
Copy link
Member

kpan2034 commented Sep 8, 2025

Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges.

It's possible for an insert to occur into the hypertable between the previous refresh being terminated, and the next refresh starting. In this case, the subsequent refresh would have to process the new invalidation log as well as the pending ranges.

item 1 is already covered by the isolation test

I don't think the test added covers this case. Since the third transaction of the refresh is terminated, it has already processed the invalidation logs and only pending ranges would be processed by the next refresh.

@fabriziomello
Copy link
Member Author

Just to clarify at this point we don't have invalidation logs to process anymore cause it was processed already and now we have only pending materialization ranges.

It's possible for an insert to occur into the hypertable between the previous refresh being terminated, and the next refresh starting. In this case, the subsequent refresh would have to process the new invalidation log as well as the pending ranges.

item 1 is already covered by the isolation test

I don't think the test added covers this case. Since the third transaction of the refresh is terminated, it has already processed the invalidation logs and only pending ranges would be processed by the next refresh.

Ahhhh I got your point... you want to make sure we're not processing materializations that are not related (non-overlap) to the refresh window and in the other way around.

Tested it manually and it worked, do you think it is mandatory for merging this PR to also add this new permutations?

@kpan2034
Copy link
Member

kpan2034 commented Sep 8, 2025

I think it's fine to add the tests with a follow-up PR since we want to include this in the 2.22.1 release.

@fabriziomello fabriziomello merged commit a0a109d into timescale:main Sep 8, 2025
46 of 47 checks passed
pnthao added a commit to pnthao/timescaledb that referenced this pull request Sep 18, 2025
PR timescale#8607 addressed the issue
when a failed refresh left behind pending materializations (i.e., rows
in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch
searched for those pending materializations that had some overlap with the
current refresh's refresh window. However that PR used the refresh window
that was passed as reference to process_cagg_invalidations_and_refresh and
as such, was modified by that function to match with the invalidated buckets.
This PR changes it to use the original refresh window (from the policy)
so that it is more likely to overlap with the pending materializations (and
more deterministic, since it doesn't depend on the data being materialized)
pnthao added a commit to pnthao/timescaledb that referenced this pull request Sep 18, 2025
PR timescale#8607 addressed the issue
when a failed refresh left behind pending materializations (i.e., rows
in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch
searched for those pending materializations that had some overlap with the
current refresh's refresh window. However that PR used the refresh window
that was passed as reference to process_cagg_invalidations_and_refresh and
as such, was modified by that function to match with the invalidated buckets.

This PR changes it to use the original refresh window (from the policy)
so that it is more likely to overlap with the pending materializations (and
more deterministic, since it doesn't depend on the data being materialized)
pnthao added a commit to pnthao/timescaledb that referenced this pull request Sep 19, 2025
PR timescale#8607 addressed the issue
when a failed refresh left behind pending materializations (i.e., rows
in _timescaledb_catalog.continuous_aggs_materialization_ranges). The patch
searched for those pending materializations that had some overlap with the
current refresh's refresh window. However that PR used the refresh window
that was passed as reference to process_cagg_invalidations_and_refresh and
as such, was modified by that function to match with the invalidated buckets.

This PR changes it to use the original refresh window (from the policy)
so that it is more likely to overlap with the pending materializations (and
more deterministic, since it doesn't depend on the data being materialized)
pnthao added a commit that referenced this pull request Sep 19, 2025
PR #8607 addressed the
issue when a failed refresh left behind pending materializations (i.e.,
rows in _timescaledb_catalog.continuous_aggs_materialization_ranges).
The patch searched for those pending materializations that had some
overlap with the current refresh's refresh window. However that PR used
the refresh window that was passed as reference to
process_cagg_invalidations_and_refresh and as such, was modified by that
function to match with the invalidated buckets.

This PR changes it to use the original refresh window (from the policy)
so that it is more likely to overlap with the pending materializations
(and more deterministic, since it doesn't depend on the data being
materialized)

Disable-check: force-changelog-file
timescale-automation pushed a commit that referenced this pull request Sep 19, 2025
PR #8607 addressed the
issue when a failed refresh left behind pending materializations (i.e.,
rows in _timescaledb_catalog.continuous_aggs_materialization_ranges).
The patch searched for those pending materializations that had some
overlap with the current refresh's refresh window. However that PR used
the refresh window that was passed as reference to
process_cagg_invalidations_and_refresh and as such, was modified by that
function to match with the invalidated buckets.

This PR changes it to use the original refresh window (from the policy)
so that it is more likely to overlap with the pending materializations
(and more deterministic, since it doesn't depend on the data being
materialized)

Disable-check: force-changelog-file
(cherry picked from commit 9a47b63)
timescale-automation pushed a commit that referenced this pull request Sep 19, 2025
PR #8607 addressed the
issue when a failed refresh left behind pending materializations (i.e.,
rows in _timescaledb_catalog.continuous_aggs_materialization_ranges).
The patch searched for those pending materializations that had some
overlap with the current refresh's refresh window. However that PR used
the refresh window that was passed as reference to
process_cagg_invalidations_and_refresh and as such, was modified by that
function to match with the invalidated buckets.

This PR changes it to use the original refresh window (from the policy)
so that it is more likely to overlap with the pending materializations
(and more deterministic, since it doesn't depend on the data being
materialized)

Disable-check: force-changelog-file
(cherry picked from commit 9a47b63)
@philkra philkra mentioned this pull request Sep 30, 2025
philkra added a commit that referenced this pull request Sep 30, 2025
## 2.22.1 (2025-09-30)

This release contains performance improvements and bug fixes since the
[2.22.0](https://github.com/timescale/timescaledb/releases/tag/2.20.0)
release. We recommend that you upgrade at the next available
opportunity.

This release blocks the ability to leverage **concurrent refresh
policies** in **hierarchical continous aggregates**, as potential
deadlocks can occur. If you have [concurrent refresh
policies](https://docs.tigerdata.com/use-timescale/latest/continuous-aggregates/refresh-policies/#add-concurrent-refresh-policies)
in **hierarchical** continous aggregates, [please disable the
jobs](https://docs.tigerdata.com/api/latest/jobs-automation/alter_job/#samples),
as following:

```
SELECT alter_job("<job_id_of_concurrent_policy>", scheduled => false);
```

**Bugfixes**
* [#7766](#7766) Load OSM
extension in retention background worker to drop tiered chunks
* [#8550](#8550) Error in
gapfill with expressions over aggregates and groupby columns and
out-of-order columns
* [#8593](#8593) Error on
change of invalidation method for continuous aggregate
* [#8599](#8599) Fix attnum
mismatch bug in chunk constraint checks
* [#8607](#8607) Fix
interrupted continous aggregate refresh materialization phase leaving
behind pending materialization ranges
* [#8638](#8638) `ALTER
TABLE RESET` for `orderby` settings
* [#8644](#8644) Fix
migration script for sparse index configuration
* [#8657](#8657) Fix
`CREATE TABLE WITH` when using UUIDv7 partitioning
* [#8659](#8659) Don't
propagate `ALTER TABLE` commands to foreign data wrapper chunks
* [#8693](#8693) Compressed
index not chosen for `varchar` typed `segmentby` columns
* [#8707](#8707) Block
concurrent refresh policies for hierarchical continous aggregate due to
potential deadlocks

**Thanks**
* @MKrkkl for reporting a bug in Gapfill queries with expressions over
aggregates and groupby columns
* @brandonpurcell-dev for creating a test case that showed a bug in
`CREATE TABLE WITH` when using UUIDv7 partitioning
* @snyrkill for reporting a bug when interrupting a continous aggregate
refresh

---------

Signed-off-by: Philip Krauss <[email protected]>
Co-authored-by: timescale-automation <123763385+github-actions[bot]@users.noreply.github.com>
Co-authored-by: philkra <[email protected]>
Co-authored-by: Philip Krauss <[email protected]>
Co-authored-by: Iain Cox <[email protected]>
@timescale-automation timescale-automation added the released-2.22.1 Released in 2.22.1 label Sep 30, 2025
philkra added a commit that referenced this pull request Oct 1, 2025
## 2.22.1 (2025-09-30)

This release contains performance improvements and bug fixes since the
[2.22.0](https://github.com/timescale/timescaledb/releases/tag/2.20.0)
release. We recommend that you upgrade at the next available
opportunity.

This release blocks the ability to leverage **concurrent refresh
policies** in **hierarchical continous aggregates**, as potential
deadlocks can occur. If you have [concurrent refresh
policies](https://docs.tigerdata.com/use-timescale/latest/continuous-aggregates/refresh-policies/#add-concurrent-refresh-policies)
in **hierarchical** continous aggregates, [please disable the
jobs](https://docs.tigerdata.com/api/latest/jobs-automation/alter_job/#samples),
as following:

```
SELECT alter_job("<job_id_of_concurrent_policy>", scheduled => false);
```

**Bugfixes**
* [#7766](#7766) Load OSM
extension in retention background worker to drop tiered chunks
* [#8550](#8550) Error in
gapfill with expressions over aggregates and groupby columns and
out-of-order columns
* [#8593](#8593) Error on
change of invalidation method for continuous aggregate
* [#8599](#8599) Fix attnum
mismatch bug in chunk constraint checks
* [#8607](#8607) Fix
interrupted continous aggregate refresh materialization phase leaving
behind pending materialization ranges
* [#8638](#8638) `ALTER
TABLE RESET` for `orderby` settings
* [#8644](#8644) Fix
migration script for sparse index configuration
* [#8657](#8657) Fix
`CREATE TABLE WITH` when using UUIDv7 partitioning
* [#8659](#8659) Don't
propagate `ALTER TABLE` commands to foreign data wrapper chunks
* [#8693](#8693) Compressed
index not chosen for `varchar` typed `segmentby` columns
* [#8707](#8707) Block
concurrent refresh policies for hierarchical continous aggregate due to
potential deadlocks

**Thanks**
* @MKrkkl for reporting a bug in Gapfill queries with expressions over
aggregates and groupby columns
* @brandonpurcell-dev for creating a test case that showed a bug in
`CREATE TABLE WITH` when using UUIDv7 partitioning
* @snyrkill for reporting a bug when interrupting a continous aggregate
refresh

---------

Signed-off-by: Philip Krauss <[email protected]>
Co-authored-by: timescale-automation <123763385+github-actions[bot]@users.noreply.github.com>
Co-authored-by: philkra <[email protected]>
Co-authored-by: Philip Krauss <[email protected]>
Co-authored-by: Iain Cox <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Interrupting a CAgg refresh in 2.22 leaves the CAgg in a inconsistent state
5 participants