Cache locality improvement for deflate State.lookahead #372

brian-pane · 2025-05-28T00:59:25Z

No description provided.

codecov · 2025-05-28T01:00:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag	Coverage Δ
fuzz-compress	`?`
fuzz-decompress	`?`
test-aarch64-apple-darwin	`93.37% <100.00%> (+<0.01%)`	⬆️
test-x86_64-apple-darwin	`91.70% <100.00%> (-0.01%)`	⬇️
test-x86_64-unknown-linux-gnu	`90.46% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
zlib-rs/src/deflate.rs	`97.12% <100.00%> (+0.09%)`	⬆️
zlib-rs/src/deflate/hash_calc.rs	`100.00% <100.00%> (ø)`
zlib-rs/src/deflate/longest_match.rs	`95.59% <100.00%> (ø)`

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

brian-pane · 2025-05-28T01:14:16Z

This change sacrifices the precomputed w_mask field, and therefore adds a subtraction instruction in the performance-critical quick_insert_string operation, in order to move the lookahead field (which also is used in performance-critical loops) into the same cache line as the other frequently-used fields.

This improves cycle count for a few compression levels on my Intel x86_64 test system. It creates an increase in instructions at compression level 2, but that doesn't result in an increase in cycles -- possibly because the CPU is able to schedule the needed subtraction operation in an otherwise unused slot?

I anticipate that the performance of this PR will vary among CPU types.

Before/after:

Benchmark 1 (69 runs): ./blogpost-compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          72.8ms ±  694us    72.1ms … 77.2ms          7 (10%)        0%
  peak_rss           26.6MB ± 65.0KB    26.3MB … 26.7MB          1 ( 1%)        0%
  cpu_cycles          283M  ±  891K      281M  …  288M           2 ( 3%)        0%
  instructions        544M  ±  268       544M  …  544M           0 ( 0%)        0%
  cache_references    263K  ± 5.70K      260K  …  303K           6 ( 9%)        0%
  cache_misses        230K  ± 7.98K      197K  …  237K           5 ( 7%)        0%
  branch_misses      2.91M  ± 6.15K     2.90M  … 2.93M           1 ( 1%)        0%
Benchmark 2 (70 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          72.1ms ±  426us    71.3ms … 73.1ms          0 ( 0%)          -  1.0% ±  0.3%
  peak_rss           26.6MB ± 59.3KB    26.5MB … 26.8MB          1 ( 1%)          +  0.0% ±  0.1%
  cpu_cycles          279M  ±  608K      278M  …  281M           0 ( 0%)        ⚡-  1.3% ±  0.1%
  instructions        549M  ±  293       549M  …  549M           0 ( 0%)        💩+  1.1% ±  0.0%
  cache_references    263K  ± 3.05K      261K  …  285K           6 ( 9%)          -  0.0% ±  0.6%
  cache_misses        231K  ± 6.85K      200K  …  240K           5 ( 7%)          +  0.5% ±  1.1%
  branch_misses      2.85M  ± 5.40K     2.84M  … 2.86M           0 ( 0%)        ⚡-  2.2% ±  0.1%
Benchmark 1 (42 runs): ./blogpost-compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           120ms ± 1.03ms     119ms …  126ms          1 ( 2%)        0%
  peak_rss           24.9MB ± 58.7KB    24.8MB … 25.0MB          0 ( 0%)        0%
  cpu_cycles          489M  ± 1.33M      487M  …  494M           1 ( 2%)        0%
  instructions       1.07G  ±  382      1.07G  … 1.07G           2 ( 5%)        0%
  cache_references    268K  ± 3.50K      264K  …  280K           1 ( 2%)        0%
  cache_misses        232K  ± 9.38K      201K  …  246K           5 (12%)        0%
  branch_misses      6.19M  ± 7.85K     6.18M  … 6.21M           3 ( 7%)        0%
Benchmark 2 (42 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           119ms ±  720us     118ms …  122ms          1 ( 2%)          -  0.6% ±  0.3%
  peak_rss           24.9MB ± 53.7KB    24.8MB … 25.0MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          486M  ± 1.08M      484M  …  488M           0 ( 0%)          -  0.7% ±  0.1%
  instructions       1.08G  ±  326      1.08G  … 1.08G           0 ( 0%)        💩+  1.0% ±  0.0%
  cache_references    272K  ± 19.2K      264K  …  385K           5 (12%)          +  1.8% ±  2.2%
  cache_misses        231K  ± 9.54K      201K  …  244K           7 (17%)          -  0.3% ±  1.8%
  branch_misses      6.20M  ± 4.94K     6.20M  … 6.22M           0 ( 0%)          +  0.2% ±  0.0%
Benchmark 1 (37 runs): ./blogpost-compress-baseline 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           138ms ± 1.68ms     136ms …  146ms          3 ( 8%)        0%
  peak_rss           24.7MB ± 88.6KB    24.5MB … 24.8MB          0 ( 0%)        0%
  cpu_cycles          567M  ± 5.26M      564M  …  590M           3 ( 8%)        0%
  instructions       1.40G  ±  352      1.40G  … 1.40G           0 ( 0%)        0%
  cache_references    270K  ± 8.04K      265K  …  311K           4 (11%)        0%
  cache_misses        234K  ± 8.03K      210K  …  240K           4 (11%)        0%
  branch_misses      7.05M  ± 5.27K     7.04M  … 7.06M           0 ( 0%)        0%
Benchmark 2 (37 runs): ./target/release/examples/blogpost-compress 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           136ms ±  914us     135ms …  141ms          1 ( 3%)          -  1.1% ±  0.5%
  peak_rss           24.7MB ±  105KB    24.5MB … 24.8MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          561M  ± 3.48M      558M  …  581M           1 ( 3%)          -  1.0% ±  0.4%
  instructions       1.41G  ±  309      1.41G  … 1.41G           0 ( 0%)          +  0.8% ±  0.0%
  cache_references    281K  ± 67.1K      265K  …  677K           3 ( 8%)          +  4.1% ±  8.2%
  cache_misses        235K  ± 7.60K      210K  …  251K           4 (11%)          +  0.2% ±  1.5%
  branch_misses      7.08M  ± 7.30K     7.07M  … 7.10M           0 ( 0%)          +  0.5% ±  0.0%
Benchmark 1 (32 runs): ./blogpost-compress-baseline 4 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           161ms ±  804us     160ms …  163ms          0 ( 0%)        0%
  peak_rss           24.5MB ±  104KB    24.3MB … 24.7MB          0 ( 0%)        0%
  cpu_cycles          668M  ±  876K      666M  …  669M           0 ( 0%)        0%
  instructions       1.50G  ±  320      1.50G  … 1.50G           0 ( 0%)        0%
  cache_references    270K  ± 3.03K      265K  …  278K           1 ( 3%)        0%
  cache_misses        234K  ± 9.02K      208K  …  242K           5 (16%)        0%
  branch_misses      7.57M  ± 6.84K     7.56M  … 7.59M           1 ( 3%)        0%
Benchmark 2 (32 runs): ./target/release/examples/blogpost-compress 4 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           159ms ±  782us     158ms …  161ms          2 ( 6%)          -  1.2% ±  0.2%
  peak_rss           24.5MB ±  107KB    24.4MB … 24.7MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          660M  ± 1.05M      658M  …  662M           0 ( 0%)        ⚡-  1.2% ±  0.1%
  instructions       1.51G  ±  266      1.51G  … 1.51G           0 ( 0%)          +  0.8% ±  0.0%
  cache_references    270K  ± 4.58K      265K  …  285K           2 ( 6%)          +  0.1% ±  0.7%
  cache_misses        233K  ± 8.12K      211K  …  240K           6 (19%)          -  0.4% ±  1.8%
  branch_misses      7.61M  ± 6.28K     7.60M  … 7.62M           0 ( 0%)          +  0.6% ±  0.0%
Benchmark 1 (28 runs): ./blogpost-compress-baseline 5 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           180ms ±  761us     179ms …  182ms          2 ( 7%)        0%
  peak_rss           24.5MB ±  114KB    24.3MB … 24.7MB          0 ( 0%)        0%
  cpu_cycles          750M  ± 1.42M      749M  …  757M           1 ( 4%)        0%
  instructions       1.72G  ±  252      1.72G  … 1.72G           0 ( 0%)        0%
  cache_references    272K  ± 7.60K      265K  …  297K           5 (18%)        0%
  cache_misses        237K  ± 1.68K      233K  …  241K           0 ( 0%)        0%
  branch_misses      8.26M  ± 6.16K     8.24M  … 8.27M           1 ( 4%)        0%
Benchmark 2 (29 runs): ./target/release/examples/blogpost-compress 5 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           177ms ± 1.10ms     177ms …  182ms          1 ( 3%)          -  1.2% ±  0.3%
  peak_rss           24.5MB ±  113KB    24.3MB … 24.7MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          740M  ±  864K      738M  …  742M           1 ( 3%)        ⚡-  1.4% ±  0.1%
  instructions       1.73G  ±  255      1.73G  … 1.73G           0 ( 0%)          +  0.7% ±  0.0%
  cache_references    273K  ± 10.2K      265K  …  310K           3 (10%)          +  0.2% ±  1.8%
  cache_misses        235K  ± 6.40K      213K  …  246K           3 (10%)          -  0.6% ±  1.1%
  branch_misses      8.28M  ± 16.9K     8.26M  … 8.33M           0 ( 0%)          +  0.3% ±  0.1%
Benchmark 1 (23 runs): ./blogpost-compress-baseline 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           221ms ±  851us     220ms …  224ms          1 ( 4%)        0%
  peak_rss           24.5MB ±  123KB    24.3MB … 24.7MB          0 ( 0%)        0%
  cpu_cycles          926M  ±  815K      925M  …  927M           0 ( 0%)        0%
  instructions       1.89G  ±  270      1.89G  … 1.89G           0 ( 0%)        0%
  cache_references    270K  ± 4.36K      266K  …  285K           1 ( 4%)        0%
  cache_misses        235K  ± 7.81K      213K  …  240K           3 (13%)        0%
  branch_misses      8.42M  ± 6.52K     8.41M  … 8.43M           0 ( 0%)        0%
Benchmark 2 (23 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           218ms ± 1.08ms     216ms …  221ms          0 ( 0%)          -  1.1% ±  0.3%
  peak_rss           24.5MB ±  122KB    24.3MB … 24.7MB          0 ( 0%)          +  0.1% ±  0.3%
  cpu_cycles          914M  ± 2.00M      912M  …  920M           2 ( 9%)        ⚡-  1.3% ±  0.1%
  instructions       1.90G  ±  313      1.90G  … 1.90G           0 ( 0%)          +  0.6% ±  0.0%
  cache_references    276K  ± 12.6K      266K  …  315K           2 ( 9%)          +  2.1% ±  2.1%
  cache_misses        235K  ± 8.75K      211K  …  245K           5 (22%)          -  0.0% ±  2.1%
  branch_misses      8.43M  ± 9.55K     8.41M  … 8.46M           1 ( 4%)          +  0.1% ±  0.1%
Benchmark 1 (17 runs): ./blogpost-compress-baseline 7 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           305ms ±  601us     304ms …  307ms          0 ( 0%)        0%
  peak_rss           24.3MB ± 52.3KB    24.2MB … 24.4MB          0 ( 0%)        0%
  cpu_cycles         1.28G  ±  729K     1.28G  … 1.29G           1 ( 6%)        0%
  instructions       2.28G  ±  276      2.28G  … 2.28G           1 ( 6%)        0%
  cache_references    275K  ± 10.7K      268K  …  308K           1 ( 6%)        0%
  cache_misses        236K  ± 5.69K      216K  …  241K           1 ( 6%)        0%
  branch_misses      9.65M  ± 8.90K     9.63M  … 9.67M           0 ( 0%)        0%
Benchmark 2 (17 runs): ./target/release/examples/blogpost-compress 7 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           302ms ±  861us     301ms …  305ms          1 ( 6%)          -  1.0% ±  0.2%
  peak_rss           24.3MB ± 48.3KB    24.3MB … 24.4MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles         1.27G  ± 1.37M     1.27G  … 1.27G           0 ( 0%)          -  1.0% ±  0.1%
  instructions       2.30G  ±  328      2.30G  … 2.30G           0 ( 0%)          +  0.6% ±  0.0%
  cache_references    278K  ± 28.1K      266K  …  383K           2 (12%)          +  1.0% ±  5.4%
  cache_misses        238K  ± 4.23K      222K  …  241K           1 ( 6%)          +  0.8% ±  1.5%
  branch_misses      9.67M  ± 10.6K     9.65M  … 9.69M           0 ( 0%)          +  0.2% ±  0.1%
Benchmark 1 (13 runs): ./blogpost-compress-baseline 8 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           401ms ± 1.13ms     400ms …  404ms          1 ( 8%)        0%
  peak_rss           24.3MB ± 54.9KB    24.3MB … 24.4MB          0 ( 0%)        0%
  cpu_cycles         1.69G  ± 1.17M     1.69G  … 1.69G           0 ( 0%)        0%
  instructions       2.73G  ±  187      2.73G  … 2.73G           1 ( 8%)        0%
  cache_references    271K  ± 6.13K      267K  …  290K           1 ( 8%)        0%
  cache_misses        237K  ± 5.91K      218K  …  242K           1 ( 8%)        0%
  branch_misses      9.77M  ± 5.90K     9.76M  … 9.79M           0 ( 0%)        0%
Benchmark 2 (13 runs): ./target/release/examples/blogpost-compress 8 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           397ms ±  722us     396ms …  398ms          0 ( 0%)          -  0.9% ±  0.2%
  peak_rss           24.4MB ± 66.6KB    24.2MB … 24.5MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles         1.67G  ± 1.69M     1.67G  … 1.68G           1 ( 8%)          -  0.9% ±  0.1%
  instructions       2.75G  ±  205      2.75G  … 2.75G           0 ( 0%)          +  0.5% ±  0.0%
  cache_references    271K  ± 3.44K      267K  …  277K           0 ( 0%)          +  0.3% ±  1.5%
  cache_misses        238K  ± 5.16K      222K  …  242K           1 ( 8%)          +  0.3% ±  1.9%
  branch_misses      9.80M  ± 13.2K     9.78M  … 9.82M           0 ( 0%)          +  0.3% ±  0.1%
Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           418ms ± 1.13ms     417ms …  420ms          0 ( 0%)        0%
  peak_rss           24.3MB ± 81.5KB    24.2MB … 24.4MB          1 ( 8%)        0%
  cpu_cycles         1.76G  ± 2.62M     1.76G  … 1.77G           0 ( 0%)        0%
  instructions       3.35G  ±  330      3.35G  … 3.35G           0 ( 0%)        0%
  cache_references    283K  ± 28.6K      266K  …  371K           1 ( 8%)        0%
  cache_misses        237K  ± 7.68K      220K  …  243K           3 (25%)        0%
  branch_misses      15.4M  ± 20.6K     15.4M  … 15.5M           0 ( 0%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           419ms ± 1.33ms     417ms …  421ms          0 ( 0%)          +  0.1% ±  0.2%
  peak_rss           24.3MB ± 73.4KB    24.2MB … 24.4MB          1 ( 8%)          -  0.0% ±  0.3%
  cpu_cycles         1.77G  ± 2.89M     1.76G  … 1.77G           0 ( 0%)          +  0.1% ±  0.1%
  instructions       3.38G  ±  277      3.38G  … 3.38G           0 ( 0%)          +  1.0% ±  0.0%
  cache_references    277K  ± 8.06K      267K  …  297K           1 ( 8%)          -  1.9% ±  6.3%
  cache_misses        236K  ± 8.79K      210K  …  244K           1 ( 8%)          -  0.3% ±  2.9%
  branch_misses      15.5M  ± 33.4K     15.5M  … 15.5M           0 ( 0%)          +  0.4% ±  0.2%

folkertdev · 2025-05-28T07:59:50Z

zlib-rs/src/deflate/hash_calc.rs

-                state.prev.as_mut_slice()[idx as usize & state.w_mask] = head;
+                state.prev.as_mut_slice()[idx as usize & state.w_mask()] = head;


I see some further improvements for level 2 when moving the state.w_mask() call out of the loop. Did/can you try that?

similarly for above

With the update I just pushed, moving state.w_mask() out of the loop doesn't help much at level 2 on my system, but it produces some improvements at higher compression levels:

Benchmark 1 (68 runs): ./blogpost-compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 73.5ms ± 776us 72.5ms … 77.5ms 1 ( 1%) 0% peak_rss 26.6MB ± 74.0KB 26.5MB … 26.8MB 0 ( 0%) 0% cpu_cycles 283M ± 743K 282M … 286M 2 ( 3%) 0% instructions 544M ± 285 544M … 544M 0 ( 0%) 0% cache_references 267K ± 18.3K 261K … 400K 7 (10%) 0% cache_misses 228K ± 9.18K 202K … 238K 8 (12%) 0% branch_misses 2.91M ± 6.68K 2.90M … 2.94M 1 ( 1%) 0% Benchmark 2 (69 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 72.8ms ± 841us 71.8ms … 78.8ms 2 ( 3%) - 1.0% ± 0.4% peak_rss 26.6MB ± 70.6KB 26.3MB … 26.8MB 1 ( 1%) - 0.1% ± 0.1% cpu_cycles 281M ± 3.12M 279M … 306M 2 ( 3%) - 0.6% ± 0.3% instructions 549M ± 289 549M … 549M 0 ( 0%) 💩+ 1.1% ± 0.0% cache_references 265K ± 5.71K 261K … 301K 6 ( 9%) - 0.9% ± 1.7% cache_misses 230K ± 6.91K 197K … 238K 4 ( 6%) + 0.7% ± 1.2% branch_misses 2.89M ± 6.34K 2.88M … 2.91M 0 ( 0%) - 0.7% ± 0.1% Benchmark 1 (42 runs): ./blogpost-compress-baseline 2 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 121ms ± 5.79ms 119ms … 157ms 1 ( 2%) 0% peak_rss 24.9MB ± 75.8KB 24.8MB … 25.1MB 0 ( 0%) 0% cpu_cycles 493M ± 23.8M 487M … 643M 1 ( 2%) 0% instructions 1.07G ± 348 1.07G … 1.07G 2 ( 5%) 0% cache_references 269K ± 11.6K 264K … 332K 4 (10%) 0% cache_misses 229K ± 9.71K 200K … 239K 7 (17%) 0% branch_misses 6.19M ± 8.99K 6.17M … 6.21M 2 ( 5%) 0% Benchmark 2 (42 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 120ms ± 1.20ms 119ms … 124ms 1 ( 2%) - 0.8% ± 1.5% peak_rss 24.9MB ± 66.1KB 24.8MB … 25.0MB 0 ( 0%) + 0.1% ± 0.1% cpu_cycles 487M ± 3.17M 484M … 503M 4 (10%) - 1.1% ± 1.5% instructions 1.08G ± 288 1.08G … 1.08G 0 ( 0%) 💩+ 1.0% ± 0.0% cache_references 271K ± 15.9K 264K … 349K 6 (14%) + 0.6% ± 2.2% cache_misses 230K ± 10.1K 206K … 246K 9 (21%) + 0.7% ± 1.9% branch_misses 6.19M ± 4.53K 6.19M … 6.21M 0 ( 0%) + 0.1% ± 0.0% Benchmark 1 (37 runs): ./blogpost-compress-baseline 3 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 138ms ± 2.21ms 137ms … 148ms 4 (11%) 0% peak_rss 24.7MB ± 83.3KB 24.6MB … 24.8MB 0 ( 0%) 0% cpu_cycles 566M ± 3.40M 564M … 581M 2 ( 5%) 0% instructions 1.40G ± 301 1.40G … 1.40G 0 ( 0%) 0% cache_references 271K ± 10.9K 265K … 314K 4 (11%) 0% cache_misses 234K ± 5.31K 215K … 242K 2 ( 5%) 0% branch_misses 7.04M ± 4.43K 7.04M … 7.05M 0 ( 0%) 0% Benchmark 2 (37 runs): ./target/release/examples/blogpost-compress 3 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 137ms ± 867us 136ms … 140ms 2 ( 5%) - 0.7% ± 0.6% peak_rss 24.6MB ± 60.8KB 24.6MB … 24.8MB 11 (30%) - 0.2% ± 0.1% cpu_cycles 563M ± 2.51M 562M … 575M 2 ( 5%) - 0.5% ± 0.2% instructions 1.41G ± 345 1.41G … 1.41G 0 ( 0%) + 0.8% ± 0.0% cache_references 269K ± 5.48K 265K … 291K 2 ( 5%) - 0.5% ± 1.5% cache_misses 232K ± 7.46K 207K … 243K 6 (16%) - 0.8% ± 1.3% branch_misses 7.07M ± 4.28K 7.06M … 7.08M 0 ( 0%) + 0.3% ± 0.0% Benchmark 1 (32 runs): ./blogpost-compress-baseline 4 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 161ms ± 563us 160ms … 163ms 1 ( 3%) 0% peak_rss 24.5MB ± 81.9KB 24.3MB … 24.7MB 1 ( 3%) 0% cpu_cycles 667M ± 791K 666M … 669M 0 ( 0%) 0% instructions 1.50G ± 339 1.50G … 1.50G 0 ( 0%) 0% cache_references 274K ± 15.2K 264K … 348K 2 ( 6%) 0% cache_misses 232K ± 7.54K 207K … 240K 3 ( 9%) 0% branch_misses 7.56M ± 4.19K 7.56M … 7.57M 0 ( 0%) 0% Benchmark 2 (32 runs): ./target/release/examples/blogpost-compress 4 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 160ms ± 978us 159ms … 163ms 2 ( 6%) - 0.6% ± 0.2% peak_rss 24.5MB ± 58.0KB 24.4MB … 24.7MB 0 ( 0%) + 0.0% ± 0.1% cpu_cycles 663M ± 3.79M 660M … 677M 3 ( 9%) - 0.6% ± 0.2% instructions 1.51G ± 363 1.51G … 1.51G 0 ( 0%) + 0.8% ± 0.0% cache_references 275K ± 20.3K 264K … 371K 3 ( 9%) + 0.6% ± 3.3% cache_misses 232K ± 7.68K 207K … 240K 4 (13%) - 0.0% ± 1.6% branch_misses 7.57M ± 6.14K 7.56M … 7.58M 0 ( 0%) + 0.1% ± 0.0% Benchmark 1 (28 runs): ./blogpost-compress-baseline 5 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 180ms ± 1.84ms 179ms … 189ms 2 ( 7%) 0% peak_rss 24.5MB ± 89.1KB 24.3MB … 24.7MB 2 ( 7%) 0% cpu_cycles 751M ± 6.22M 749M … 782M 5 (18%) 0% instructions 1.72G ± 319 1.72G … 1.72G 1 ( 4%) 0% cache_references 279K ± 51.0K 265K … 537K 3 (11%) 0% cache_misses 231K ± 9.08K 205K … 240K 4 (14%) 0% branch_misses 8.25M ± 7.67K 8.24M … 8.27M 0 ( 0%) 0% Benchmark 2 (29 runs): ./target/release/examples/blogpost-compress 5 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 178ms ± 2.26ms 177ms … 188ms 2 ( 7%) - 0.8% ± 0.6% peak_rss 24.5MB ± 69.1KB 24.3MB … 24.6MB 1 ( 3%) - 0.1% ± 0.2% cpu_cycles 741M ± 891K 740M … 743M 0 ( 0%) - 1.3% ± 0.3% instructions 1.73G ± 334 1.73G … 1.73G 0 ( 0%) + 0.7% ± 0.0% cache_references 271K ± 10.7K 265K … 321K 3 (10%) - 2.8% ± 7.0% cache_misses 233K ± 9.67K 209K … 250K 8 (28%) + 0.8% ± 2.2% branch_misses 8.22M ± 7.87K 8.20M … 8.24M 3 (10%) - 0.4% ± 0.1% Benchmark 1 (23 runs): ./blogpost-compress-baseline 6 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 221ms ± 882us 219ms … 223ms 0 ( 0%) 0% peak_rss 24.5MB ± 97.1KB 24.4MB … 24.7MB 0 ( 0%) 0% cpu_cycles 926M ± 1.16M 925M … 929M 1 ( 4%) 0% instructions 1.89G ± 300 1.89G … 1.89G 0 ( 0%) 0% cache_references 287K ± 76.4K 266K … 637K 1 ( 4%) 0% cache_misses 234K ± 6.08K 217K … 242K 2 ( 9%) 0% branch_misses 8.42M ± 8.76K 8.40M … 8.43M 0 ( 0%) 0% Benchmark 2 (23 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 219ms ± 752us 218ms … 221ms 0 ( 0%) - 1.1% ± 0.2% peak_rss 24.5MB ± 74.7KB 24.4MB … 24.7MB 0 ( 0%) + 0.0% ± 0.2% cpu_cycles 915M ± 785K 914M … 917M 0 ( 0%) ⚡- 1.2% ± 0.1% instructions 1.90G ± 699 1.90G … 1.90G 1 ( 4%) + 0.6% ± 0.0% cache_references 272K ± 7.03K 266K … 293K 2 ( 9%) - 5.1% ± 11.3% cache_misses 236K ± 2.95K 229K … 242K 0 ( 0%) + 0.7% ± 1.2% branch_misses 8.40M ± 7.02K 8.39M … 8.41M 0 ( 0%) - 0.2% ± 0.1% Benchmark 1 (17 runs): ./blogpost-compress-baseline 7 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 306ms ± 1.01ms 305ms … 309ms 1 ( 6%) 0% peak_rss 24.4MB ± 83.9KB 24.2MB … 24.5MB 0 ( 0%) 0% cpu_cycles 1.28G ± 1.64M 1.28G … 1.29G 0 ( 0%) 0% instructions 2.28G ± 328 2.28G … 2.28G 0 ( 0%) 0% cache_references 272K ± 6.44K 266K … 295K 4 (24%) 0% cache_misses 235K ± 5.73K 217K … 239K 2 (12%) 0% branch_misses 9.64M ± 6.96K 9.63M … 9.65M 0 ( 0%) 0% Benchmark 2 (17 runs): ./target/release/examples/blogpost-compress 7 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 301ms ± 4.07ms 299ms … 316ms 1 ( 6%) - 1.6% ± 0.7% peak_rss 24.4MB ± 58.7KB 24.3MB … 24.5MB 0 ( 0%) - 0.1% ± 0.2% cpu_cycles 1.26G ± 6.00M 1.26G … 1.28G 2 (12%) ⚡- 1.9% ± 0.2% instructions 2.30G ± 335 2.30G … 2.30G 2 (12%) + 0.6% ± 0.0% cache_references 289K ± 41.7K 267K … 404K 2 (12%) + 5.9% ± 7.7% cache_misses 238K ± 6.50K 222K … 251K 3 (18%) + 1.2% ± 1.8% branch_misses 9.58M ± 14.8K 9.54M … 9.60M 0 ( 0%) - 0.7% ± 0.1% Benchmark 1 (13 runs): ./blogpost-compress-baseline 8 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 400ms ± 2.21ms 394ms … 404ms 3 (23%) 0% peak_rss 24.4MB ± 76.2KB 24.3MB … 24.5MB 0 ( 0%) 0% cpu_cycles 1.69G ± 954K 1.69G … 1.69G 0 ( 0%) 0% instructions 2.73G ± 300 2.73G … 2.73G 0 ( 0%) 0% cache_references 273K ± 5.98K 265K … 286K 0 ( 0%) 0% cache_misses 235K ± 4.99K 223K … 242K 2 (15%) 0% branch_misses 9.77M ± 7.98K 9.76M … 9.79M 0 ( 0%) 0% Benchmark 2 (13 runs): ./target/release/examples/blogpost-compress 8 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 394ms ± 503us 393ms … 395ms 0 ( 0%) ⚡- 1.6% ± 0.3% peak_rss 24.4MB ± 71.8KB 24.2MB … 24.5MB 1 ( 8%) + 0.0% ± 0.2% cpu_cycles 1.66G ± 1.34M 1.66G … 1.66G 0 ( 0%) ⚡- 1.7% ± 0.1% instructions 2.75G ± 164 2.75G … 2.75G 0 ( 0%) + 0.5% ± 0.0% cache_references 277K ± 16.3K 266K … 318K 0 ( 0%) + 1.7% ± 3.6% cache_misses 237K ± 4.41K 229K … 245K 1 ( 8%) + 0.7% ± 1.6% branch_misses 9.70M ± 13.0K 9.68M … 9.73M 0 ( 0%) - 0.7% ± 0.1% Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 419ms ± 935us 418ms … 421ms 2 (17%) 0% peak_rss 24.4MB ± 49.4KB 24.2MB … 24.4MB 4 (33%) 0% cpu_cycles 1.76G ± 1.51M 1.76G … 1.77G 0 ( 0%) 0% instructions 3.35G ± 313 3.35G … 3.35G 1 ( 8%) 0% cache_references 279K ± 24.6K 266K … 355K 1 ( 8%) 0% cache_misses 235K ± 7.01K 221K … 242K 1 ( 8%) 0% branch_misses 15.4M ± 17.0K 15.4M … 15.5M 0 ( 0%) 0% Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 422ms ± 1.51ms 420ms … 425ms 0 ( 0%) + 0.7% ± 0.3% peak_rss 24.4MB ± 97.2KB 24.2MB … 24.6MB 0 ( 0%) + 0.1% ± 0.3% cpu_cycles 1.77G ± 2.27M 1.77G … 1.78G 0 ( 0%) + 0.4% ± 0.1% instructions 3.38G ± 375 3.38G … 3.38G 0 ( 0%) + 1.0% ± 0.0% cache_references 278K ± 8.24K 268K … 296K 0 ( 0%) - 0.1% ± 5.6% cache_misses 238K ± 4.10K 230K … 245K 0 ( 0%) + 1.5% ± 2.1% branch_misses 15.8M ± 40.1K 15.7M … 15.9M 0 ( 0%) 💩+ 2.2% ± 0.2%

folkertdev

Looks good, thanks!

folkertdev reviewed May 28, 2025

View reviewed changes

Cache locality improvement for deflate State.lookahead

d205112

brian-pane force-pushed the cache-layout branch from 8e3f9e6 to d205112 Compare May 28, 2025 14:15

folkertdev approved these changes May 28, 2025

View reviewed changes

folkertdev merged commit f10deb4 into trifectatechfoundation:main May 28, 2025
24 checks passed

brian-pane deleted the cache-layout branch May 28, 2025 21:29

BrewTestBot mentioned this pull request Jun 6, 2025

zlib-rs 0.5.1 Homebrew/homebrew-core#225936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cache locality improvement for deflate State.lookahead #372

Cache locality improvement for deflate State.lookahead #372

Uh oh!

brian-pane commented May 28, 2025

Uh oh!

codecov bot commented May 28, 2025 •

edited

Loading

Uh oh!

brian-pane commented May 28, 2025

Uh oh!

folkertdev May 28, 2025

Uh oh!

folkertdev May 28, 2025

Uh oh!

brian-pane May 28, 2025

Uh oh!

folkertdev left a comment

Uh oh!

Uh oh!

Uh oh!

		state.prev.as_mut_slice()[idx as usize & state.w_mask] = head;
		state.prev.as_mut_slice()[idx as usize & state.w_mask()] = head;

Uh oh!

Cache locality improvement for deflate State.lookahead #372

Cache locality improvement for deflate State.lookahead #372

Uh oh!

Conversation

brian-pane commented May 28, 2025

Uh oh!

codecov bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brian-pane commented May 28, 2025

Uh oh!

folkertdev May 28, 2025

Choose a reason for hiding this comment

Uh oh!

folkertdev May 28, 2025

Choose a reason for hiding this comment

Uh oh!

brian-pane May 28, 2025

Choose a reason for hiding this comment

Uh oh!

folkertdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented May 28, 2025 •

edited

Loading