-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
A bug in code last touched in 2020 causes tsm values to be used multiple times as the output of the ArrayAscendingCursors. These cursors are used to merge cache and tsm values. Typically they return 1000 points as that is the default tsm block size, but they may return fewer (never more as the response buffer has a fixed size). The cursor Next()
methods have logic for merging cache and tsm values when the time ranges overlap, copying tsm data if the cache is exhausted (or has no matching data), and for handling cache data if the tsm data is exhausted. The tsm copying code attempts to see if it can copy a full block of 1000 points from the tsm source buffer to the response buffer to avoid the overhead of reslicing the source and destination arrays. The condition check for the optimization is missing a required precondition: namely c.tsm.pos == 0
. Without this check, tsm values that have already been returned from the cursor may be returned again in a sequent Next()
call.
func (c *floatArrayAscendingCursor) Next() *tsdb.FloatArray { |
influxdb/tsdb/engine/tsm1/array_cursor.gen.go
Line 113 in fe6c64b
if pos == 0 && len(c.res.Timestamps) >= len(tvals.Timestamps) { |
Several factors make this bug rare to encounter:
- These cursors are only used by the flux query processing; not by influxql. The flux query endpoint is off by default in 1.x oss.
- The cursor must have matching cache and tsm data for the same time range and be within the same shard. Short shard durations decrease the likelihood of encountering this bug.
- The cache data must be predominantly older than the tsm data. This is backwards for how most write to the database which is close to "current time". Most don't write older data after newer data.
- The default cache size is 25MB. This often means the data in the cache and tsm files can be separated by minutes of time as the cache fills, which helps with slightly out of order writing.
- If your write requests don't include a timestamp so influxdb picks it, the points are always monotonically increasing. The bug requires newer data to have arrived before older data.
- Descending array cursors are not impacted (but ascending is the default sort order).
Steps to reproduce:
List the minimal actions needed to reproduce the behaviour.
- influxdb 1.11.8+
- write a 1000 points and shorten the cold duration so they are written to tsm (or wait for snapshot). The time range of these points should be within one shard.
- if you changed the cold snapshot duration, increase it.
- write 800 points with timestamps within the same shard time range, but have the timestamps of the points be older than the values written to tsm.
- Using flux, count the number of points over the full time range of the shard. If the bug is encountered, the result will be 2000 points instead of 1800 points.