[1.x oss]: Ascending key cursors may return same tsm values multiple times if merging with cache values

A bug in code last touched in 2020 causes tsm values to be used multiple times as the output of the ArrayAscendingCursors. These cursors are used to merge cache and tsm values. Typically they return 1000 points as that is the default tsm block size, but they may return fewer (never more as the response buffer has a fixed size). The cursor `Next()` methods have logic for merging cache and tsm values when the time ranges overlap, copying tsm data if the cache is exhausted (or has no matching data), and for handling cache data if the tsm data is exhausted. The tsm copying code attempts to see if it can copy a full block of 1000 points from the tsm source buffer to the response buffer to avoid the overhead of reslicing the source and destination arrays. The condition check for the optimization is missing a required precondition: namely `c.tsm.pos == 0`. Without this check, tsm values that have already been returned from the cursor may be returned again in a sequent `Next()` call. 

https://github.com/influxdata/influxdb/blob/fe6c64b21ed7e0757375e57b8eca21e9c05f3c89/tsdb/engine/tsm1/array_cursor.gen.go#L78

https://github.com/influxdata/influxdb/blob/fe6c64b21ed7e0757375e57b8eca21e9c05f3c89/tsdb/engine/tsm1/array_cursor.gen.go#L113

Several factors make this bug rare to encounter:
* These cursors are only used by the flux query processing; not by influxql. The flux query endpoint is off by default in 1.x oss. 
* The cursor must have matching cache and tsm data for the same time range and be within the same shard. Short shard durations decrease the likelihood of encountering this bug.
* The cache data must be predominantly older than the tsm data. This is backwards for how most write to the database which is close to "current time". Most don't write older data after newer data. 
* The default cache size is 25MB. This often means the data in the cache and tsm files can be separated by minutes of time as the cache fills, which helps with slightly out of order writing. 
* If your write requests don't include a timestamp so influxdb picks it, the points are always monotonically increasing. The bug requires newer data to have arrived before older data. 
* Descending array cursors are not impacted (but ascending is the default sort order).

__Steps to reproduce:__
List the minimal actions needed to reproduce the behaviour.

1. influxdb 1.11.8+ 
2. write a 1000 points and shorten the cold duration so they are written to tsm (or wait for snapshot). The time range of these points should be within one shard.
3. if you changed the cold snapshot duration, increase it.
4. write 800 points with timestamps within the _same shard_ time range, but have the timestamps of the points be older than the values written to tsm. 
5. Using flux, count the number of points over the full time range of the shard. If the bug is encountered, the result will be 2000 points instead of 1800 points. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1.x oss]: Ascending key cursors may return same tsm values multiple times if merging with cache values #26690

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[1.x oss]: Ascending key cursors may return same tsm values multiple times if merging with cache values #26690

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions