Skip to content

Conversation

feikesteenbergen
Copy link
Member

@feikesteenbergen feikesteenbergen commented Apr 9, 2025

Performance tweaks for uddsketch rollup.

When profiling some of our queries (using perf or Instruments) it
was clear that there was a hot piece of code on merge_sketch.
commit 5da4ab2 addresses this.

After addressing this the HashEntry calls was quite high in the profile and with commit f010516 we were able to reduce the runtime by roughly another half.

Dummy data set improvements

For a 600MB data set, with 300k rows, grouped into 3100 buckets, with a dataset
that is fully prewarmed in shared_buffers, and work_mem being high enough
to prevent on-disk sorts, these are some results:

kind allocations time needed mem pct time pct
baseline (no rollup)* 2713 kB 215.248 ms
baseline (with rollup) 8218825 kB 5887.086 ms 100.0 % 100.0 %
tweak merge_sketch 1286316 kB 2157.163 ms 15.7 % 36.6 %
tweak entry() 1209574 kB 1143.912 ms 14.7 % 19.4 %
introduce entry_upsert() 1285968 kB 1032.128 ms 15.6 % 17.5 %

Some production data set, 1.5 million records, buckets = 1984:

kind allocations time needed mem pct time pct
baseline (no rollup)* 26891 kB 3206.245 ms
baseline (with rollup) 62747020 kB 66543.369 ms 100.0 % 100.0 %
tweak entry() 23245638 kB 15875.682 ms 37.0 % 23.9 %
introduce entry_upsert() 23245642 kB 13829.614 ms 37.0 % 20.8 %

Same production data set, 1.5 million records, buckets = 186277:

kind allocations time needed mem pct time pct
baseline (no rollup)* 26731096 kB 7499.256 ms
baseline (with rollup) 73056871 kB 68607.422 ms 100.0 % 100.0 %
tweak entry() 52589861 kB 33231.835 ms 72.0 % 48.4 %
introduce entry_upsert() 52589399 kB 30150.045 ms 72.0 % 43.9 %

*: Added here to show the theoretical lower limit of the time/memory needed

Optimization: merge UDDSketch using iterators

The previous implementation would create a UDDSketch (with a backing
HashMap) for every possible merge, and then call compact_buckets on
that in order to ensure the number of compactions between the target and
the source were equal.

Profiling this, we found out that in a rollup call of a lot of data,
the compact_buckets was pretty much the main contributor to all the
CPU time.

However, if we merge a different sketch into this sketch, we don't need
to actually compact_buckets all the time, we can directly consume the
keys and counts, and apply some compact_key calls to it.

This prevents a lot of heap allocations, as compact_buckets does a fully
copy of the backing HashMap, and then rebuilding it.

For a particular workload, this reduced the execution time from 30 to 12
seconds.

Change implementation for entry()

Profiling showed that this function is quite the hotspot. By changing
the implementation slightly, instead of walking the tree using the
Linked List, but iterate directly over the values, we improve the
throughput of certain CPU bound queries.

We've seen a reduction in time needed of > 50% for certain rollup
queries.

@feikesteenbergen feikesteenbergen requested a review from dbeck April 9, 2025 11:06
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 3 times, most recently from 1bc5009 to 33315d1 Compare April 9, 2025 12:45
@feikesteenbergen feikesteenbergen marked this pull request as ready for review April 9, 2025 13:24
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 3 times, most recently from 35ccb8b to ab7c9ec Compare April 10, 2025 11:29
Copy link

@sgichohi sgichohi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@feikesteenbergen
Copy link
Member Author

There's probaby similar improvements in other parts of the code possible. However, I'd like to merge those kinds of fixes one by one myself.

@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch from 24ad8de to 1bdeaf2 Compare April 10, 2025 15:07
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch from 1bdeaf2 to 84e9ee8 Compare April 11, 2025 08:14
@feikesteenbergen feikesteenbergen changed the title Feike/reduce compact bucket calls Performance improvements for UDDSketch Apr 11, 2025
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 11 times, most recently from 66f9ce1 to 2a97a28 Compare April 14, 2025 09:30
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 3 times, most recently from 6ebf6ec to f74585b Compare April 14, 2025 14:52
The previous implementation would create a UDDSketch (with a backing
HashMap) for every possible merge, and then call `compact_buckets` on
that in order to ensure the number of compactions between the target and
the source were equal.

Profiling this, we found out that in a `rollup` call of a lot of data,
the `compact_buckets` was pretty much the main contributor to all the
CPU time.

However, if we merge a different sketch into this sketch, we don't need
to actually compact_buckets all the time, we can directly consume the
keys and counts, and apply some compact_key calls to it.

This prevents a lot of heap allocations, as compact_buckets does a fully
copy of the backing `HashMap`, and then rebuilding it.

For a particular workload, this reduced the execution time from 30 to 12
seconds.
Profiling showed that this function is quite the hotspot. By changing
the implementation slightly, instead of walking the tree using the
Linked List, but iterate directly over the values, we improve the
throughput of certain CPU bound queries.

We've seen a reduction in time needed of > 50% for certain rollup
queries.

Due to the way entry() was called, and the way the Borrow Checker is
unable to help us keep 2 mutable references into a map, we were doing
double lookups into the backing HashMap pretty much always when this
function was called.

However, looking at the code, the only callers of this function only
wanted to either increment by 1, or by a count.

Therefore, make a function to actually support that usecase, which
doesn't have this problem with the Borrow Checker, as it doesn't have to
return a mutable reference: It actually does the work immediately
We got it slightly wrong previously, when we used the number of values
to reserve Heap memory, but we actually want the number of buckets.
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 4 times, most recently from 1920efe to 2564a82 Compare April 15, 2025 10:18
@feikesteenbergen feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch from 2564a82 to 8d007c2 Compare April 15, 2025 10:22
@feikesteenbergen feikesteenbergen merged commit df79383 into main Apr 15, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants