-
Notifications
You must be signed in to change notification settings - Fork 62
Performance improvements for UDDSketch #853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1bc5009
to
33315d1
Compare
feikesteenbergen
commented
Apr 9, 2025
35ccb8b
to
ab7c9ec
Compare
sgichohi
approved these changes
Apr 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
There's probaby similar improvements in other parts of the code possible. However, I'd like to merge those kinds of fixes one by one myself. |
24ad8de
to
1bdeaf2
Compare
1bdeaf2
to
84e9ee8
Compare
66f9ce1
to
2a97a28
Compare
6ebf6ec
to
f74585b
Compare
The previous implementation would create a UDDSketch (with a backing HashMap) for every possible merge, and then call `compact_buckets` on that in order to ensure the number of compactions between the target and the source were equal. Profiling this, we found out that in a `rollup` call of a lot of data, the `compact_buckets` was pretty much the main contributor to all the CPU time. However, if we merge a different sketch into this sketch, we don't need to actually compact_buckets all the time, we can directly consume the keys and counts, and apply some compact_key calls to it. This prevents a lot of heap allocations, as compact_buckets does a fully copy of the backing `HashMap`, and then rebuilding it. For a particular workload, this reduced the execution time from 30 to 12 seconds.
Profiling showed that this function is quite the hotspot. By changing the implementation slightly, instead of walking the tree using the Linked List, but iterate directly over the values, we improve the throughput of certain CPU bound queries. We've seen a reduction in time needed of > 50% for certain rollup queries. Due to the way entry() was called, and the way the Borrow Checker is unable to help us keep 2 mutable references into a map, we were doing double lookups into the backing HashMap pretty much always when this function was called. However, looking at the code, the only callers of this function only wanted to either increment by 1, or by a count. Therefore, make a function to actually support that usecase, which doesn't have this problem with the Borrow Checker, as it doesn't have to return a mutable reference: It actually does the work immediately
We got it slightly wrong previously, when we used the number of values to reserve Heap memory, but we actually want the number of buckets.
1920efe
to
2564a82
Compare
2564a82
to
8d007c2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Performance tweaks for uddsketch rollup.
When profiling some of our queries (using
perf
or Instruments) itwas clear that there was a hot piece of code on
merge_sketch
.commit 5da4ab2 addresses this.
After addressing this the
HashEntry
calls was quite high in the profile and with commit f010516 we were able to reduce the runtime by roughly another half.Dummy data set improvements
For a 600MB data set, with 300k rows, grouped into 3100 buckets, with a dataset
that is fully prewarmed in
shared_buffers
, andwork_mem
being high enoughto prevent on-disk sorts, these are some results:
merge_sketch
entry()
entry_upsert()
Some production data set, 1.5 million records, buckets = 1984:
entry()
entry_upsert()
Same production data set, 1.5 million records, buckets = 186277:
entry()
entry_upsert()
Optimization: merge UDDSketch using iterators
The previous implementation would create a UDDSketch (with a backing
HashMap) for every possible merge, and then call
compact_buckets
onthat in order to ensure the number of compactions between the target and
the source were equal.
Profiling this, we found out that in a
rollup
call of a lot of data,the
compact_buckets
was pretty much the main contributor to all theCPU time.
However, if we merge a different sketch into this sketch, we don't need
to actually compact_buckets all the time, we can directly consume the
keys and counts, and apply some compact_key calls to it.
This prevents a lot of heap allocations, as compact_buckets does a fully
copy of the backing
HashMap
, and then rebuilding it.For a particular workload, this reduced the execution time from 30 to 12
seconds.
Change implementation for entry()
Profiling showed that this function is quite the hotspot. By changing
the implementation slightly, instead of walking the tree using the
Linked List, but iterate directly over the values, we improve the
throughput of certain CPU bound queries.
We've seen a reduction in time needed of > 50% for certain rollup
queries.