Performance improvements for UDDSketch #853

feikesteenbergen · 2025-04-09T11:06:21Z

Performance tweaks for uddsketch rollup.

When profiling some of our queries (using perf or Instruments) it
was clear that there was a hot piece of code on merge_sketch.
commit 5da4ab2 addresses this.

After addressing this the HashEntry calls was quite high in the profile and with commit f010516 we were able to reduce the runtime by roughly another half.

Dummy data set improvements

For a 600MB data set, with 300k rows, grouped into 3100 buckets, with a dataset
that is fully prewarmed in shared_buffers, and work_mem being high enough
to prevent on-disk sorts, these are some results:

kind	allocations	time needed	mem pct	time pct
baseline (no rollup)*	2713 kB	215.248 ms
baseline (with rollup)	8218825 kB	5887.086 ms	100.0 %	100.0 %
tweak `merge_sketch`	1286316 kB	2157.163 ms	15.7 %	36.6 %
tweak `entry()`	1209574 kB	1143.912 ms	14.7 %	19.4 %
introduce `entry_upsert()`	1285968 kB	1032.128 ms	15.6 %	17.5 %

Some production data set, 1.5 million records, buckets = 1984:

kind	allocations	time needed	mem pct	time pct
baseline (no rollup)*	26891 kB	3206.245 ms
baseline (with rollup)	62747020 kB	66543.369 ms	100.0 %	100.0 %
tweak `entry()`	23245638 kB	15875.682 ms	37.0 %	23.9 %
introduce `entry_upsert()`	23245642 kB	13829.614 ms	37.0 %	20.8 %

Same production data set, 1.5 million records, buckets = 186277:

kind	allocations	time needed	mem pct	time pct
baseline (no rollup)*	26731096 kB	7499.256 ms
baseline (with rollup)	73056871 kB	68607.422 ms	100.0 %	100.0 %
tweak `entry()`	52589861 kB	33231.835 ms	72.0 %	48.4 %
introduce `entry_upsert()`	52589399 kB	30150.045 ms	72.0 %	43.9 %

*: Added here to show the theoretical lower limit of the time/memory needed

Optimization: merge UDDSketch using iterators

The previous implementation would create a UDDSketch (with a backing
HashMap) for every possible merge, and then call compact_buckets on
that in order to ensure the number of compactions between the target and
the source were equal.

Profiling this, we found out that in a rollup call of a lot of data,
the compact_buckets was pretty much the main contributor to all the
CPU time.

However, if we merge a different sketch into this sketch, we don't need
to actually compact_buckets all the time, we can directly consume the
keys and counts, and apply some compact_key calls to it.

This prevents a lot of heap allocations, as compact_buckets does a fully
copy of the backing HashMap, and then rebuilding it.

For a particular workload, this reduced the execution time from 30 to 12
seconds.

Change implementation for entry()

Profiling showed that this function is quite the hotspot. By changing
the implementation slightly, instead of walking the tree using the
Linked List, but iterate directly over the values, we improve the
throughput of certain CPU bound queries.

We've seen a reduction in time needed of > 50% for certain rollup
queries.

crates/udd-sketch/src/lib.rs

sgichohi

Awesome!

feikesteenbergen · 2025-04-10T13:18:39Z

There's probaby similar improvements in other parts of the code possible. However, I'd like to merge those kinds of fixes one by one myself.

The previous implementation would create a UDDSketch (with a backing HashMap) for every possible merge, and then call `compact_buckets` on that in order to ensure the number of compactions between the target and the source were equal. Profiling this, we found out that in a `rollup` call of a lot of data, the `compact_buckets` was pretty much the main contributor to all the CPU time. However, if we merge a different sketch into this sketch, we don't need to actually compact_buckets all the time, we can directly consume the keys and counts, and apply some compact_key calls to it. This prevents a lot of heap allocations, as compact_buckets does a fully copy of the backing `HashMap`, and then rebuilding it. For a particular workload, this reduced the execution time from 30 to 12 seconds.

Profiling showed that this function is quite the hotspot. By changing the implementation slightly, instead of walking the tree using the Linked List, but iterate directly over the values, we improve the throughput of certain CPU bound queries. We've seen a reduction in time needed of > 50% for certain rollup queries. Due to the way entry() was called, and the way the Borrow Checker is unable to help us keep 2 mutable references into a map, we were doing double lookups into the backing HashMap pretty much always when this function was called. However, looking at the code, the only callers of this function only wanted to either increment by 1, or by a count. Therefore, make a function to actually support that usecase, which doesn't have this problem with the Borrow Checker, as it doesn't have to return a mutable reference: It actually does the work immediately

We got it slightly wrong previously, when we used the number of values to reserve Heap memory, but we actually want the number of buckets.

feikesteenbergen requested a review from dbeck April 9, 2025 11:06

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 3 times, most recently from 1bc5009 to 33315d1 Compare April 9, 2025 12:45

feikesteenbergen marked this pull request as ready for review April 9, 2025 13:24

feikesteenbergen commented Apr 9, 2025

View reviewed changes

crates/udd-sketch/src/lib.rs Show resolved Hide resolved

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 3 times, most recently from 35ccb8b to ab7c9ec Compare April 10, 2025 11:29

feikesteenbergen requested review from RobAtticus and sgichohi April 10, 2025 11:57

sgichohi approved these changes Apr 10, 2025

View reviewed changes

feikesteenbergen mentioned this pull request Apr 10, 2025

Feike/malloc performance improvements #852

Closed

feikesteenbergen requested a review from syvb April 10, 2025 13:17

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch from 24ad8de to 1bdeaf2 Compare April 10, 2025 15:07

feikesteenbergen requested a review from jamessewell April 11, 2025 07:39

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch from 1bdeaf2 to 84e9ee8 Compare April 11, 2025 08:14

feikesteenbergen changed the title ~~Feike/reduce compact bucket calls~~ Performance improvements for UDDSketch Apr 11, 2025

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 11 times, most recently from 66f9ce1 to 2a97a28 Compare April 14, 2025 09:30

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 3 times, most recently from 6ebf6ec to f74585b Compare April 14, 2025 14:52

feikesteenbergen added 6 commits April 15, 2025 11:39

Ensure we test multiple merge_sketch implementations

9627c65

Reserve HashMap capacity if length is known

7fb5e17

Prevent memory allocation when creating new Sketch

4f44da1

Use the correct size when allocating memory

4e0b6ab

We got it slightly wrong previously, when we used the number of values to reserve Heap memory, but we actually want the number of buckets.

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch 4 times, most recently from 1920efe to 2564a82 Compare April 15, 2025 10:18

feikesteenbergen added 6 commits April 15, 2025 12:22

implement Ord for SketchHashKey

1505346

Remove unneeded serde derives

b1d0b54

Constrain UDDSketch struct members to valid ranges

2edb56a

Compact using the stack when possible

3834b5b

Slightly more efficient creation of new UDDSketch

52149f0

Formatting, linting

8d007c2

feikesteenbergen force-pushed the feike/reduce_compact_bucket_calls branch from 2564a82 to 8d007c2 Compare April 15, 2025 10:22

feikesteenbergen merged commit df79383 into main Apr 15, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvements for UDDSketch #853

Performance improvements for UDDSketch #853

Uh oh!

feikesteenbergen commented Apr 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

sgichohi left a comment

Uh oh!

feikesteenbergen commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Performance improvements for UDDSketch #853

Performance improvements for UDDSketch #853

Uh oh!

Conversation

feikesteenbergen commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance tweaks for uddsketch rollup.

Dummy data set improvements

Some production data set, 1.5 million records, buckets = 1984:

Same production data set, 1.5 million records, buckets = 186277:

Optimization: merge UDDSketch using iterators

Change implementation for entry()

Uh oh!

Uh oh!

sgichohi left a comment

Choose a reason for hiding this comment

Uh oh!

feikesteenbergen commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feikesteenbergen commented Apr 9, 2025 •

edited

Loading