Skip to content

Conversation

philjb
Copy link
Contributor

@philjb philjb commented Aug 12, 2025

The array ascending cursors do not check all the required conditions before doing an optimization to copy a full 1000 points of data from the tsm read buffer to the cursors response buffer. This can result in points from tsm files appearing in cursor output multiple times.

The conditions the bug appears are rare as it requires cache data that is older (or overlapping) with tsm data. The data must be in the same shard and must match for the same key.

The array ascending cursors do not check all the required conditions
before doing an optimization to copy a full 1000 points of data from the
tsm read buffer to the cursors response buffer. This can result in
points from tsm files appearing in cursor output multiple times.

The conditions the bug appears are rare as it requires cache data that
is older (or overlapping) with tsm data. The data must be in the same
shard and must match for the same key.

* fixes #26690
if c.tsm.pos < len(tvals.Timestamps) {
if pos == 0 && len(c.res.Timestamps) >= len(tvals.Timestamps) {
if pos == 0 && c.tsm.pos == 0 && len(c.res.Timestamps) >= len(tvals.Timestamps) {
// optimization: all points can be served from TSM data because
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much of an optimization this is anymore. I would expect (maybe) that go doesn't reslice (create a new slice header) if copy(dst[0:], src[0:]) is used, but i haven't looked yet.

Copy link
Contributor Author

@philjb philjb Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy calls into runtime.memmove. No slice headers are made but setup is done to get the arguments to memmove to point to the correct place in each backing array. memmove itself handles overlap per the spec.

Looks like this optimization avoids some setup to memmove at the code of the extra conditionals. Probably worthwhile to remove the "optimization" to have less code to debug and maintain. A microbench mark might show some speed differences.

This commit adds a test that verifies the duplicate data bug fix. It
will fail without the previous commit. See the test for details; in
short it sets up the right cache and tsm data to expose the bug which
was occuring when the cache had older data than the tsm files and then
the cache was exhausted while there were still more tsm values to
include in more Next() calls.
@philjb philjb requested review from devanbenz and gwossum August 14, 2025 23:04
kc := fs.KeyCursor(context.Background(), []byte("measurement,field=value#!~#value"), 0, true)
defer kc.Close()

cursor := newFloatArrayAscendingCursor()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test only looks at float array cursors but all types are impacted. They are made from a template so i made the assumption testing one type also covers the other datatypes.

// and incorrectly copies TSM data including the points from the previous Next() call. Additionally, cache data
// needs to be older than most of the TSM data. There needs to be enough tsm data younger than the cache data to completely
// fill the remaining space in the response buffer.
func TestAscendingCursorDuplicateDataBug(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI helped me make this test more robust; but it also made it rather long with the setup and precondition verifications.

Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cleaner with test helpers from

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using modern test helpers, like require.NoError instead of Fatalf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

The assert package is used for post conditions and the require package
is used for pre-conditions in the test. The library provides clearer
intent of conditions and better formatting of information (message and
data) in the situation an assertion fails. It also makes it easier to
read by reducing ifs but some lines are longer.

This commit addresses PR requests.
@philjb philjb requested a review from davidby-influx August 20, 2025 19:00
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@philjb philjb merged commit 9d1e79d into master-1.x Aug 20, 2025
9 checks passed
@philjb philjb deleted the pjb/26690/tsm-buffer-reuse-bug branch August 20, 2025 21:22
devanbenz pushed a commit that referenced this pull request Sep 29, 2025
* fix(tsm1): fix condition check for optimization of array cursor

The array ascending cursors do not check all the required conditions
before doing an optimization to copy a full 1000 points of data from the
tsm read buffer to the cursors response buffer. This can result in
points from tsm files appearing in cursor output multiple times.

The conditions the bug appears are rare as it requires cache data that
is older (or overlapping) with tsm data. The data must be in the same
shard and must match for the same key.

* fixes #26690

* chore(tsm1): add test that covers array cursor bug

This commit adds a test that verifies the duplicate data bug fix. It
will fail without the previous commit. See the test for details; in
short it sets up the right cache and tsm data to expose the bug which
was occuring when the cache had older data than the tsm files and then
the cache was exhausted while there were still more tsm values to
include in more Next() calls.

* chore(test): switch to using the testify library in the new test

The assert package is used for post conditions and the require package
is used for pre-conditions in the test. The library provides clearer
intent of conditions and better formatting of information (message and
data) in the situation an assertion fails. It also makes it easier to
read by reducing ifs but some lines are longer.

This commit addresses PR requests.

(cherry picked from commit 9d1e79d)
devanbenz added a commit that referenced this pull request Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.x area/flux Issues related to the Flux query engine kind/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[1.x oss]: Ascending key cursors may return same tsm values multiple times if merging with cache values

2 participants