-
-
Notifications
You must be signed in to change notification settings - Fork 28
Cache locality improvement for deflate State.lookahead #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache locality improvement for deflate State.lookahead #372
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
This change sacrifices the precomputed This improves cycle count for a few compression levels on my Intel x86_64 test system. It creates an increase in instructions at compression level 2, but that doesn't result in an increase in cycles -- possibly because the CPU is able to schedule the needed subtraction operation in an otherwise unused slot? I anticipate that the performance of this PR will vary among CPU types. Before/after:
|
zlib-rs/src/deflate/hash_calc.rs
Outdated
state.prev.as_mut_slice()[idx as usize & state.w_mask] = head; | ||
state.prev.as_mut_slice()[idx as usize & state.w_mask()] = head; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some further improvements for level 2 when moving the state.w_mask()
call out of the loop. Did/can you try that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly for above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the update I just pushed, moving state.w_mask()
out of the loop doesn't help much at level 2 on my system, but it produces some improvements at higher compression levels:
Benchmark 1 (68 runs): ./blogpost-compress-baseline 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 73.5ms ± 776us 72.5ms … 77.5ms 1 ( 1%) 0%
peak_rss 26.6MB ± 74.0KB 26.5MB … 26.8MB 0 ( 0%) 0%
cpu_cycles 283M ± 743K 282M … 286M 2 ( 3%) 0%
instructions 544M ± 285 544M … 544M 0 ( 0%) 0%
cache_references 267K ± 18.3K 261K … 400K 7 (10%) 0%
cache_misses 228K ± 9.18K 202K … 238K 8 (12%) 0%
branch_misses 2.91M ± 6.68K 2.90M … 2.94M 1 ( 1%) 0%
Benchmark 2 (69 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 72.8ms ± 841us 71.8ms … 78.8ms 2 ( 3%) - 1.0% ± 0.4%
peak_rss 26.6MB ± 70.6KB 26.3MB … 26.8MB 1 ( 1%) - 0.1% ± 0.1%
cpu_cycles 281M ± 3.12M 279M … 306M 2 ( 3%) - 0.6% ± 0.3%
instructions 549M ± 289 549M … 549M 0 ( 0%) 💩+ 1.1% ± 0.0%
cache_references 265K ± 5.71K 261K … 301K 6 ( 9%) - 0.9% ± 1.7%
cache_misses 230K ± 6.91K 197K … 238K 4 ( 6%) + 0.7% ± 1.2%
branch_misses 2.89M ± 6.34K 2.88M … 2.91M 0 ( 0%) - 0.7% ± 0.1%
Benchmark 1 (42 runs): ./blogpost-compress-baseline 2 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 121ms ± 5.79ms 119ms … 157ms 1 ( 2%) 0%
peak_rss 24.9MB ± 75.8KB 24.8MB … 25.1MB 0 ( 0%) 0%
cpu_cycles 493M ± 23.8M 487M … 643M 1 ( 2%) 0%
instructions 1.07G ± 348 1.07G … 1.07G 2 ( 5%) 0%
cache_references 269K ± 11.6K 264K … 332K 4 (10%) 0%
cache_misses 229K ± 9.71K 200K … 239K 7 (17%) 0%
branch_misses 6.19M ± 8.99K 6.17M … 6.21M 2 ( 5%) 0%
Benchmark 2 (42 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 120ms ± 1.20ms 119ms … 124ms 1 ( 2%) - 0.8% ± 1.5%
peak_rss 24.9MB ± 66.1KB 24.8MB … 25.0MB 0 ( 0%) + 0.1% ± 0.1%
cpu_cycles 487M ± 3.17M 484M … 503M 4 (10%) - 1.1% ± 1.5%
instructions 1.08G ± 288 1.08G … 1.08G 0 ( 0%) 💩+ 1.0% ± 0.0%
cache_references 271K ± 15.9K 264K … 349K 6 (14%) + 0.6% ± 2.2%
cache_misses 230K ± 10.1K 206K … 246K 9 (21%) + 0.7% ± 1.9%
branch_misses 6.19M ± 4.53K 6.19M … 6.21M 0 ( 0%) + 0.1% ± 0.0%
Benchmark 1 (37 runs): ./blogpost-compress-baseline 3 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 138ms ± 2.21ms 137ms … 148ms 4 (11%) 0%
peak_rss 24.7MB ± 83.3KB 24.6MB … 24.8MB 0 ( 0%) 0%
cpu_cycles 566M ± 3.40M 564M … 581M 2 ( 5%) 0%
instructions 1.40G ± 301 1.40G … 1.40G 0 ( 0%) 0%
cache_references 271K ± 10.9K 265K … 314K 4 (11%) 0%
cache_misses 234K ± 5.31K 215K … 242K 2 ( 5%) 0%
branch_misses 7.04M ± 4.43K 7.04M … 7.05M 0 ( 0%) 0%
Benchmark 2 (37 runs): ./target/release/examples/blogpost-compress 3 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 137ms ± 867us 136ms … 140ms 2 ( 5%) - 0.7% ± 0.6%
peak_rss 24.6MB ± 60.8KB 24.6MB … 24.8MB 11 (30%) - 0.2% ± 0.1%
cpu_cycles 563M ± 2.51M 562M … 575M 2 ( 5%) - 0.5% ± 0.2%
instructions 1.41G ± 345 1.41G … 1.41G 0 ( 0%) + 0.8% ± 0.0%
cache_references 269K ± 5.48K 265K … 291K 2 ( 5%) - 0.5% ± 1.5%
cache_misses 232K ± 7.46K 207K … 243K 6 (16%) - 0.8% ± 1.3%
branch_misses 7.07M ± 4.28K 7.06M … 7.08M 0 ( 0%) + 0.3% ± 0.0%
Benchmark 1 (32 runs): ./blogpost-compress-baseline 4 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 161ms ± 563us 160ms … 163ms 1 ( 3%) 0%
peak_rss 24.5MB ± 81.9KB 24.3MB … 24.7MB 1 ( 3%) 0%
cpu_cycles 667M ± 791K 666M … 669M 0 ( 0%) 0%
instructions 1.50G ± 339 1.50G … 1.50G 0 ( 0%) 0%
cache_references 274K ± 15.2K 264K … 348K 2 ( 6%) 0%
cache_misses 232K ± 7.54K 207K … 240K 3 ( 9%) 0%
branch_misses 7.56M ± 4.19K 7.56M … 7.57M 0 ( 0%) 0%
Benchmark 2 (32 runs): ./target/release/examples/blogpost-compress 4 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 160ms ± 978us 159ms … 163ms 2 ( 6%) - 0.6% ± 0.2%
peak_rss 24.5MB ± 58.0KB 24.4MB … 24.7MB 0 ( 0%) + 0.0% ± 0.1%
cpu_cycles 663M ± 3.79M 660M … 677M 3 ( 9%) - 0.6% ± 0.2%
instructions 1.51G ± 363 1.51G … 1.51G 0 ( 0%) + 0.8% ± 0.0%
cache_references 275K ± 20.3K 264K … 371K 3 ( 9%) + 0.6% ± 3.3%
cache_misses 232K ± 7.68K 207K … 240K 4 (13%) - 0.0% ± 1.6%
branch_misses 7.57M ± 6.14K 7.56M … 7.58M 0 ( 0%) + 0.1% ± 0.0%
Benchmark 1 (28 runs): ./blogpost-compress-baseline 5 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 180ms ± 1.84ms 179ms … 189ms 2 ( 7%) 0%
peak_rss 24.5MB ± 89.1KB 24.3MB … 24.7MB 2 ( 7%) 0%
cpu_cycles 751M ± 6.22M 749M … 782M 5 (18%) 0%
instructions 1.72G ± 319 1.72G … 1.72G 1 ( 4%) 0%
cache_references 279K ± 51.0K 265K … 537K 3 (11%) 0%
cache_misses 231K ± 9.08K 205K … 240K 4 (14%) 0%
branch_misses 8.25M ± 7.67K 8.24M … 8.27M 0 ( 0%) 0%
Benchmark 2 (29 runs): ./target/release/examples/blogpost-compress 5 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 178ms ± 2.26ms 177ms … 188ms 2 ( 7%) - 0.8% ± 0.6%
peak_rss 24.5MB ± 69.1KB 24.3MB … 24.6MB 1 ( 3%) - 0.1% ± 0.2%
cpu_cycles 741M ± 891K 740M … 743M 0 ( 0%) - 1.3% ± 0.3%
instructions 1.73G ± 334 1.73G … 1.73G 0 ( 0%) + 0.7% ± 0.0%
cache_references 271K ± 10.7K 265K … 321K 3 (10%) - 2.8% ± 7.0%
cache_misses 233K ± 9.67K 209K … 250K 8 (28%) + 0.8% ± 2.2%
branch_misses 8.22M ± 7.87K 8.20M … 8.24M 3 (10%) - 0.4% ± 0.1%
Benchmark 1 (23 runs): ./blogpost-compress-baseline 6 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 221ms ± 882us 219ms … 223ms 0 ( 0%) 0%
peak_rss 24.5MB ± 97.1KB 24.4MB … 24.7MB 0 ( 0%) 0%
cpu_cycles 926M ± 1.16M 925M … 929M 1 ( 4%) 0%
instructions 1.89G ± 300 1.89G … 1.89G 0 ( 0%) 0%
cache_references 287K ± 76.4K 266K … 637K 1 ( 4%) 0%
cache_misses 234K ± 6.08K 217K … 242K 2 ( 9%) 0%
branch_misses 8.42M ± 8.76K 8.40M … 8.43M 0 ( 0%) 0%
Benchmark 2 (23 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 219ms ± 752us 218ms … 221ms 0 ( 0%) - 1.1% ± 0.2%
peak_rss 24.5MB ± 74.7KB 24.4MB … 24.7MB 0 ( 0%) + 0.0% ± 0.2%
cpu_cycles 915M ± 785K 914M … 917M 0 ( 0%) ⚡- 1.2% ± 0.1%
instructions 1.90G ± 699 1.90G … 1.90G 1 ( 4%) + 0.6% ± 0.0%
cache_references 272K ± 7.03K 266K … 293K 2 ( 9%) - 5.1% ± 11.3%
cache_misses 236K ± 2.95K 229K … 242K 0 ( 0%) + 0.7% ± 1.2%
branch_misses 8.40M ± 7.02K 8.39M … 8.41M 0 ( 0%) - 0.2% ± 0.1%
Benchmark 1 (17 runs): ./blogpost-compress-baseline 7 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 306ms ± 1.01ms 305ms … 309ms 1 ( 6%) 0%
peak_rss 24.4MB ± 83.9KB 24.2MB … 24.5MB 0 ( 0%) 0%
cpu_cycles 1.28G ± 1.64M 1.28G … 1.29G 0 ( 0%) 0%
instructions 2.28G ± 328 2.28G … 2.28G 0 ( 0%) 0%
cache_references 272K ± 6.44K 266K … 295K 4 (24%) 0%
cache_misses 235K ± 5.73K 217K … 239K 2 (12%) 0%
branch_misses 9.64M ± 6.96K 9.63M … 9.65M 0 ( 0%) 0%
Benchmark 2 (17 runs): ./target/release/examples/blogpost-compress 7 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 301ms ± 4.07ms 299ms … 316ms 1 ( 6%) - 1.6% ± 0.7%
peak_rss 24.4MB ± 58.7KB 24.3MB … 24.5MB 0 ( 0%) - 0.1% ± 0.2%
cpu_cycles 1.26G ± 6.00M 1.26G … 1.28G 2 (12%) ⚡- 1.9% ± 0.2%
instructions 2.30G ± 335 2.30G … 2.30G 2 (12%) + 0.6% ± 0.0%
cache_references 289K ± 41.7K 267K … 404K 2 (12%) + 5.9% ± 7.7%
cache_misses 238K ± 6.50K 222K … 251K 3 (18%) + 1.2% ± 1.8%
branch_misses 9.58M ± 14.8K 9.54M … 9.60M 0 ( 0%) - 0.7% ± 0.1%
Benchmark 1 (13 runs): ./blogpost-compress-baseline 8 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 400ms ± 2.21ms 394ms … 404ms 3 (23%) 0%
peak_rss 24.4MB ± 76.2KB 24.3MB … 24.5MB 0 ( 0%) 0%
cpu_cycles 1.69G ± 954K 1.69G … 1.69G 0 ( 0%) 0%
instructions 2.73G ± 300 2.73G … 2.73G 0 ( 0%) 0%
cache_references 273K ± 5.98K 265K … 286K 0 ( 0%) 0%
cache_misses 235K ± 4.99K 223K … 242K 2 (15%) 0%
branch_misses 9.77M ± 7.98K 9.76M … 9.79M 0 ( 0%) 0%
Benchmark 2 (13 runs): ./target/release/examples/blogpost-compress 8 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 394ms ± 503us 393ms … 395ms 0 ( 0%) ⚡- 1.6% ± 0.3%
peak_rss 24.4MB ± 71.8KB 24.2MB … 24.5MB 1 ( 8%) + 0.0% ± 0.2%
cpu_cycles 1.66G ± 1.34M 1.66G … 1.66G 0 ( 0%) ⚡- 1.7% ± 0.1%
instructions 2.75G ± 164 2.75G … 2.75G 0 ( 0%) + 0.5% ± 0.0%
cache_references 277K ± 16.3K 266K … 318K 0 ( 0%) + 1.7% ± 3.6%
cache_misses 237K ± 4.41K 229K … 245K 1 ( 8%) + 0.7% ± 1.6%
branch_misses 9.70M ± 13.0K 9.68M … 9.73M 0 ( 0%) - 0.7% ± 0.1%
Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 419ms ± 935us 418ms … 421ms 2 (17%) 0%
peak_rss 24.4MB ± 49.4KB 24.2MB … 24.4MB 4 (33%) 0%
cpu_cycles 1.76G ± 1.51M 1.76G … 1.77G 0 ( 0%) 0%
instructions 3.35G ± 313 3.35G … 3.35G 1 ( 8%) 0%
cache_references 279K ± 24.6K 266K … 355K 1 ( 8%) 0%
cache_misses 235K ± 7.01K 221K … 242K 1 ( 8%) 0%
branch_misses 15.4M ± 17.0K 15.4M … 15.5M 0 ( 0%) 0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 422ms ± 1.51ms 420ms … 425ms 0 ( 0%) + 0.7% ± 0.3%
peak_rss 24.4MB ± 97.2KB 24.2MB … 24.6MB 0 ( 0%) + 0.1% ± 0.3%
cpu_cycles 1.77G ± 2.27M 1.77G … 1.78G 0 ( 0%) + 0.4% ± 0.1%
instructions 3.38G ± 375 3.38G … 3.38G 0 ( 0%) + 1.0% ± 0.0%
cache_references 278K ± 8.24K 268K … 296K 0 ( 0%) - 0.1% ± 5.6%
cache_misses 238K ± 4.10K 230K … 245K 0 ( 0%) + 1.5% ± 2.1%
branch_misses 15.8M ± 40.1K 15.7M … 15.9M 0 ( 0%) 💩+ 2.2% ± 0.2%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
No description provided.