Optimize out some read operations from the fast deflate algorithm #375

brian-pane · 2025-05-29T16:24:22Z

No description provided.

codecov · 2025-05-29T16:25:44Z

Codecov Report

Attention: Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zlib-rs/src/deflate/algorithm/fast.rs	97.87%	1 Missing ⚠️

Flag	Coverage Δ
fuzz-compress	`?`
fuzz-decompress	`?`
test-aarch64-apple-darwin	`93.36% <97.87%> (-0.02%)`	⬇️
test-x86_64-apple-darwin	`91.70% <97.87%> (+<0.01%)`	⬆️
test-x86_64-unknown-linux-gnu	`90.43% <97.87%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
zlib-rs/src/deflate/hash_calc.rs	`100.00% <ø> (ø)`
zlib-rs/src/deflate/algorithm/fast.rs	`96.51% <97.87%> (-1.05%)`	⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

brian-pane · 2025-05-29T16:38:12Z

Notes:

This includes a mild restructuring of the loop logic inside deflate_fast, which I did to make the logic easier to reason about while making changes. But I think the result might be easier for the compiler to optimize, especially on targets with limited registers, now that local values like match_len aren't maintained across loop iterations.
I had to add an additional #[inline] in hash_calc.rs to keep this from causing a regression at compression level 1.
It might be possible to add an additional speedup in the future: if lookahead is >= 8, fetch the next 8 bytes instead of just 4. We could then use the first 4 of those bytes for the quick_insert_value call and then pass all 8 to longest_match to save another read from the window.

brian-pane · 2025-05-29T16:44:48Z

Benchmark 1 (72 runs): ./blogpost-compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          70.0ms ± 1.29ms    68.8ms … 74.4ms          6 ( 8%)        0%
  peak_rss           26.6MB ± 66.4KB    26.4MB … 26.8MB          1 ( 1%)        0%
  cpu_cycles          267M  ±  921K      265M  …  271M           3 ( 4%)        0%
  instructions        524M  ±  277       524M  …  524M           0 ( 0%)        0%
  cache_references    264K  ± 5.95K      261K  …  302K           7 (10%)        0%
  cache_misses        229K  ± 9.91K      198K  …  246K          11 (15%)        0%
  branch_misses      2.85M  ± 3.78K     2.84M  … 2.85M           0 ( 0%)        0%
Benchmark 2 (73 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          69.4ms ± 1.09ms    68.1ms … 75.2ms          5 ( 7%)          -  0.9% ±  0.6%
  peak_rss           26.7MB ± 77.2KB    26.5MB … 26.8MB          0 ( 0%)          +  0.2% ±  0.1%
  cpu_cycles          265M  ± 1.09M      263M  …  270M           4 ( 5%)          -  0.8% ±  0.1%
  instructions        509M  ±  324       509M  …  509M           0 ( 0%)        ⚡-  2.8% ±  0.0%
  cache_references    264K  ± 3.35K      260K  …  281K           8 (11%)          -  0.2% ±  0.6%
  cache_misses        229K  ± 10.8K      193K  …  244K          11 (15%)          -  0.3% ±  1.5%
  branch_misses      2.92M  ± 4.85K     2.91M  … 2.93M           0 ( 0%)        💩+  2.6% ±  0.0%
Benchmark 1 (42 runs): ./blogpost-compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           120ms ±  791us     119ms …  122ms          0 ( 0%)        0%
  peak_rss           24.9MB ± 73.0KB    24.8MB … 25.1MB          0 ( 0%)        0%
  cpu_cycles          487M  ± 1.47M      483M  …  491M           1 ( 2%)        0%
  instructions       1.07G  ±  317      1.07G  … 1.07G           1 ( 2%)        0%
  cache_references    267K  ± 2.86K      263K  …  276K           1 ( 2%)        0%
  cache_misses        234K  ± 3.56K      227K  …  243K           0 ( 0%)        0%
  branch_misses      6.19M  ± 6.27K     6.17M  … 6.21M           3 ( 7%)        0%
Benchmark 2 (43 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           117ms ±  813us     116ms …  120ms          1 ( 2%)        ⚡-  2.4% ±  0.3%
  peak_rss           25.0MB ± 77.6KB    24.8MB … 25.1MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          473M  ± 1.41M      471M  …  479M           2 ( 5%)        ⚡-  2.7% ±  0.1%
  instructions       1.05G  ±  453      1.05G  … 1.05G           1 ( 2%)        ⚡-  2.1% ±  0.0%
  cache_references    272K  ± 14.3K      264K  …  325K           6 (14%)          +  2.1% ±  1.7%
  cache_misses        233K  ± 8.68K      204K  …  243K           3 ( 7%)          -  0.4% ±  1.2%
  branch_misses      6.19M  ± 4.89K     6.18M  … 6.20M           1 ( 2%)          +  0.0% ±  0.0%
Benchmark 1 (37 runs): ./blogpost-compress-baseline 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           137ms ± 4.45ms     135ms …  162ms          2 ( 5%)        0%
  peak_rss           24.7MB ± 62.1KB    24.6MB … 24.8MB          0 ( 0%)        0%
  cpu_cycles          563M  ± 19.0M      558M  …  673M           4 (11%)        0%
  instructions       1.40G  ±  266      1.40G  … 1.40G           0 ( 0%)        0%
  cache_references    271K  ± 7.42K      264K  …  305K           2 ( 5%)        0%
  cache_misses        233K  ± 6.48K      203K  …  245K           3 ( 8%)        0%
  branch_misses      7.04M  ± 3.99K     7.03M  … 7.05M           2 ( 5%)        0%
Benchmark 2 (37 runs): ./target/release/examples/blogpost-compress 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           136ms ± 1.64ms     135ms …  143ms          2 ( 5%)          -  0.5% ±  1.1%
  peak_rss           24.7MB ± 87.5KB    24.6MB … 24.9MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          559M  ± 4.02M      557M  …  581M           3 ( 8%)          -  0.8% ±  1.1%
  instructions       1.39G  ±  322      1.39G  … 1.39G           1 ( 3%)        ⚡-  1.2% ±  0.0%
  cache_references    270K  ± 9.24K      262K  …  314K           3 ( 8%)          -  0.1% ±  1.4%
  cache_misses        233K  ± 6.52K      210K  …  243K           2 ( 5%)          +  0.0% ±  1.3%
  branch_misses      7.06M  ± 4.64K     7.05M  … 7.07M           1 ( 3%)          +  0.3% ±  0.0%
Benchmark 1 (32 runs): ./blogpost-compress-baseline 4 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           160ms ± 3.08ms     158ms …  175ms          2 ( 6%)        0%
  peak_rss           24.5MB ± 72.7KB    24.4MB … 24.7MB          0 ( 0%)        0%
  cpu_cycles          661M  ± 4.21M      658M  …  681M           3 ( 9%)        0%
  instructions       1.50G  ±  450      1.50G  … 1.50G           2 ( 6%)        0%
  cache_references    279K  ± 23.1K      265K  …  382K           4 (13%)        0%
  cache_misses        233K  ± 6.84K      212K  …  250K           4 (13%)        0%
  branch_misses      7.55M  ± 5.39K     7.54M  … 7.57M           3 ( 9%)        0%
Benchmark 2 (32 runs): ./target/release/examples/blogpost-compress 4 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           159ms ±  907us     158ms …  162ms          2 ( 6%)          -  0.7% ±  0.7%
  peak_rss           24.6MB ± 74.7KB    24.5MB … 24.7MB          0 ( 0%)          +  0.2% ±  0.2%
  cpu_cycles          658M  ±  877K      656M  …  660M           0 ( 0%)          -  0.5% ±  0.2%
  instructions       1.48G  ±  314      1.48G  … 1.48G           0 ( 0%)        ⚡-  1.2% ±  0.0%
  cache_references    271K  ± 7.86K      264K  …  302K           1 ( 3%)          -  2.9% ±  3.1%
  cache_misses        234K  ± 7.12K      207K  …  244K           2 ( 6%)          +  0.5% ±  1.5%
  branch_misses      7.62M  ± 6.15K     7.60M  … 7.63M           0 ( 0%)          +  0.9% ±  0.0%
Benchmark 1 (29 runs): ./blogpost-compress-baseline 5 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           176ms ± 1.01ms     175ms …  179ms          4 (14%)        0%
  peak_rss           24.5MB ± 84.7KB    24.3MB … 24.7MB          1 ( 3%)        0%
  cpu_cycles          733M  ± 2.45M      731M  …  744M           1 ( 3%)        0%
  instructions       1.73G  ±  307      1.73G  … 1.73G           0 ( 0%)        0%
  cache_references    270K  ± 3.94K      265K  …  283K           1 ( 3%)        0%
  cache_misses        233K  ± 6.81K      206K  …  242K           5 (17%)        0%
  branch_misses      8.28M  ± 32.3K     8.22M  … 8.35M           0 ( 0%)        0%
Benchmark 2 (29 runs): ./target/release/examples/blogpost-compress 5 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           175ms ±  550us     175ms …  177ms          1 ( 3%)          -  0.6% ±  0.2%
  peak_rss           24.6MB ± 84.1KB    24.5MB … 24.7MB          0 ( 0%)          +  0.2% ±  0.2%
  cpu_cycles          731M  ±  578K      731M  …  733M           1 ( 3%)          -  0.2% ±  0.1%
  instructions       1.70G  ±  336      1.70G  … 1.70G           0 ( 0%)        ⚡-  1.6% ±  0.0%
  cache_references    268K  ± 2.88K      264K  …  276K           0 ( 0%)          -  0.8% ±  0.7%
  cache_misses        231K  ± 8.21K      203K  …  238K           4 (14%)          -  0.9% ±  1.7%
  branch_misses      8.25M  ± 15.6K     8.22M  … 8.31M           2 ( 7%)          -  0.4% ±  0.2%
Benchmark 1 (24 runs): ./blogpost-compress-baseline 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           217ms ±  746us     215ms …  219ms          1 ( 4%)        0%
  peak_rss           24.5MB ± 79.7KB    24.3MB … 24.7MB          1 ( 4%)        0%
  cpu_cycles          907M  ± 1.35M      906M  …  910M           0 ( 0%)        0%
  instructions       1.90G  ±  356      1.90G  … 1.90G           2 ( 8%)        0%
  cache_references    275K  ± 16.9K      265K  …  346K           4 (17%)        0%
  cache_misses        233K  ± 7.84K      206K  …  238K           2 ( 8%)        0%
  branch_misses      8.44M  ± 30.0K     8.39M  … 8.50M           0 ( 0%)        0%
Benchmark 2 (24 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           217ms ± 1.52ms     215ms …  221ms          3 (13%)          -  0.0% ±  0.3%
  peak_rss           24.6MB ± 67.8KB    24.5MB … 24.7MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          905M  ±  516K      904M  …  906M           0 ( 0%)          -  0.2% ±  0.1%
  instructions       1.87G  ±  353      1.87G  … 1.87G           0 ( 0%)        ⚡-  1.5% ±  0.0%
  cache_references    272K  ± 11.3K      265K  …  315K           3 (13%)          -  1.0% ±  3.1%
  cache_misses        232K  ± 9.62K      207K  …  245K           4 (17%)          -  0.4% ±  2.2%
  branch_misses      8.39M  ± 7.52K     8.38M  … 8.41M           0 ( 0%)          -  0.5% ±  0.2%
Benchmark 1 (17 runs): ./blogpost-compress-baseline 7 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           300ms ± 2.76ms     295ms …  309ms          2 (12%)        0%
  peak_rss           24.4MB ± 58.2KB    24.3MB … 24.5MB          0 ( 0%)        0%
  cpu_cycles         1.26G  ± 6.34M     1.26G  … 1.28G           1 ( 6%)        0%
  instructions       2.30G  ±  279      2.30G  … 2.30G           0 ( 0%)        0%
  cache_references    273K  ± 8.60K      266K  …  295K           3 (18%)        0%
  cache_misses        233K  ± 8.21K      211K  …  245K           4 (24%)        0%
  branch_misses      9.56M  ± 7.53K     9.55M  … 9.57M           0 ( 0%)        0%
Benchmark 2 (17 runs): ./target/release/examples/blogpost-compress 7 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           298ms ± 1.47ms     294ms …  301ms          1 ( 6%)          -  0.7% ±  0.5%
  peak_rss           24.5MB ± 77.4KB    24.4MB … 24.6MB          0 ( 0%)          +  0.3% ±  0.2%
  cpu_cycles         1.25G  ± 3.88M     1.25G  … 1.27G           2 (12%)          -  0.4% ±  0.3%
  instructions       2.27G  ±  291      2.27G  … 2.27G           0 ( 0%)        ⚡-  1.1% ±  0.0%
  cache_references    269K  ± 4.03K      264K  …  282K           1 ( 6%)          -  1.5% ±  1.7%
  cache_misses        233K  ± 7.61K      212K  …  240K           2 (12%)          -  0.3% ±  2.4%
  branch_misses      9.53M  ± 10.9K     9.50M  … 9.55M           1 ( 6%)          -  0.3% ±  0.1%
Benchmark 1 (13 runs): ./blogpost-compress-baseline 8 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           393ms ± 1.98ms     387ms …  396ms          1 ( 8%)        0%
  peak_rss           24.4MB ± 70.9KB    24.3MB … 24.5MB          0 ( 0%)        0%
  cpu_cycles         1.66G  ± 1.07M     1.66G  … 1.66G           0 ( 0%)        0%
  instructions       2.75G  ±  261      2.75G  … 2.75G           0 ( 0%)        0%
  cache_references    274K  ± 12.7K      267K  …  315K           1 ( 8%)        0%
  cache_misses        232K  ± 8.00K      214K  …  240K           2 (15%)        0%
  branch_misses      9.69M  ± 10.7K     9.67M  … 9.70M           0 ( 0%)        0%
Benchmark 2 (13 runs): ./target/release/examples/blogpost-compress 8 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           392ms ± 2.11ms     386ms …  394ms          1 ( 8%)          -  0.3% ±  0.4%
  peak_rss           24.5MB ± 89.1KB    24.4MB … 24.6MB          0 ( 0%)          +  0.1% ±  0.3%
  cpu_cycles         1.66G  ± 1.36M     1.65G  … 1.66G           0 ( 0%)          -  0.3% ±  0.1%
  instructions       2.72G  ±  427      2.72G  … 2.72G           2 (15%)          -  1.0% ±  0.0%
  cache_references    270K  ± 5.15K      265K  …  281K           0 ( 0%)          -  1.4% ±  2.9%
  cache_misses        237K  ± 1.81K      233K  …  240K           1 ( 8%)          +  2.1% ±  2.0%
  branch_misses      9.66M  ± 11.8K     9.64M  … 9.68M           0 ( 0%)          -  0.3% ±  0.1%
Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           420ms ± 1.90ms     416ms …  425ms          2 (17%)        0%
  peak_rss           24.4MB ± 68.2KB    24.2MB … 24.5MB          4 (33%)        0%
  cpu_cycles         1.77G  ± 2.77M     1.77G  … 1.78G           0 ( 0%)        0%
  instructions       3.38G  ±  319      3.38G  … 3.38G           0 ( 0%)        0%
  cache_references    274K  ± 10.4K      265K  …  299K           0 ( 0%)        0%
  cache_misses        228K  ± 9.65K      216K  …  241K           0 ( 0%)        0%
  branch_misses      15.8M  ± 27.0K     15.8M  … 15.8M           0 ( 0%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           421ms ± 1.03ms     420ms …  424ms          1 ( 8%)          +  0.2% ±  0.3%
  peak_rss           24.4MB ± 94.8KB    24.2MB … 24.6MB          0 ( 0%)          +  0.2% ±  0.3%
  cpu_cycles         1.78G  ± 4.21M     1.77G  … 1.79G           1 ( 8%)          +  0.2% ±  0.2%
  instructions       3.41G  ±  190      3.41G  … 3.41G           0 ( 0%)          +  0.9% ±  0.0%
  cache_references    282K  ± 28.6K      265K  …  366K           1 ( 8%)          +  2.8% ±  6.6%
  cache_misses        235K  ± 3.25K      231K  …  240K           0 ( 0%)          +  3.1% ±  2.7%
  branch_misses      15.5M  ± 24.4K     15.5M  … 15.6M           3 (25%)        ⚡-  1.7% ±  0.1%

folkertdev · 2025-05-29T17:25:10Z

zlib-rs/src/deflate/algorithm/fast.rs

-            /* Find the longest match, discarding those <= prev_length.
-             * At this point we have always match length < WANT_MIN_MATCH
-             */
+            // Find the longest match for the string starting at offset state.strstart.


did we lose the property that

At this point we have always match length < WANT_MIN_MATCH

I'd assume not, in which case I'd like to keep that bit of the comment.

In the new version, match_length doesn't even exist until after that point.

ah, the diff made that hard to see. Now the check for the length is right below the definition, so that will do.

folkertdev · 2025-05-29T17:26:10Z

zlib-rs/src/deflate/algorithm/fast.rs

-    let mut bflush; /* set if current block must be flushed */
-    let mut dist;
-    let mut match_len = 0;


Historically these variables were pre-declared as an optimization I think. But I do buy that it's not needed any more and in fact having the scope be more restricted might help in re-using registers/stack slots.

folkertdev

excellent work, these are my local numbers

Benchmark 2 (41 runs): target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           125ms ± 1.51ms     123ms …  132ms          1 ( 2%)        ⚡-  2.9% ±  0.6%
  peak_rss           24.9MB ± 73.2KB    24.8MB … 25.0MB          0 ( 0%)          -  0.0% ±  0.2%
  cpu_cycles          500M  ± 5.04M      493M  …  523M           2 ( 5%)        ⚡-  3.6% ±  0.6%
  instructions       1.08G  ±  379      1.08G  … 1.08G           0 ( 0%)          -  0.6% ±  0.0%
  cache_references   33.7M  ±  339K     33.3M  … 34.8M           1 ( 2%)          +  0.2% ±  0.5%
  cache_misses       1.14M  ±  181K      794K  … 1.79M           1 ( 2%)          +  8.9% ± 10.5%
  branch_misses      6.26M  ± 3.11K     6.26M  … 6.27M           2 ( 5%)          -  0.1% ±  0.0%

with all of the other levels having no significant changes.

folkertdev · 2025-05-29T17:39:02Z

zlib-rs/src/deflate/algorithm/fast.rs

-            /* Find the longest match, discarding those <= prev_length.
-             * At this point we have always match length < WANT_MIN_MATCH
-             */
+            // Find the longest match for the string starting at offset state.strstart.


ah, the diff made that hard to see. Now the check for the length is right below the definition, so that will do.

Optimize out some read operations from the fast deflate algorithm

5976106

folkertdev reviewed May 29, 2025

View reviewed changes

folkertdev approved these changes May 29, 2025

View reviewed changes

folkertdev merged commit 7fafed0 into trifectatechfoundation:main May 29, 2025
24 checks passed

brian-pane deleted the reuse-read branch May 29, 2025 17:58

BrewTestBot mentioned this pull request Jun 6, 2025

zlib-rs 0.5.1 Homebrew/homebrew-core#225936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize out some read operations from the fast deflate algorithm #375

Optimize out some read operations from the fast deflate algorithm #375

Uh oh!

brian-pane commented May 29, 2025

Uh oh!

codecov bot commented May 29, 2025 •

edited

Loading

Uh oh!

brian-pane commented May 29, 2025

Uh oh!

brian-pane commented May 29, 2025

Uh oh!

folkertdev May 29, 2025

Uh oh!

brian-pane May 29, 2025

Uh oh!

folkertdev May 29, 2025

Uh oh!

folkertdev May 29, 2025

Uh oh!

folkertdev left a comment

Uh oh!

folkertdev May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimize out some read operations from the fast deflate algorithm #375

Optimize out some read operations from the fast deflate algorithm #375

Uh oh!

Conversation

brian-pane commented May 29, 2025

Uh oh!

codecov bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brian-pane commented May 29, 2025

Uh oh!

brian-pane commented May 29, 2025

Uh oh!

folkertdev May 29, 2025

Choose a reason for hiding this comment

Uh oh!

brian-pane May 29, 2025

Choose a reason for hiding this comment

Uh oh!

folkertdev May 29, 2025

Choose a reason for hiding this comment

Uh oh!

folkertdev May 29, 2025

Choose a reason for hiding this comment

Uh oh!

folkertdev left a comment

Choose a reason for hiding this comment

Uh oh!

folkertdev May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented May 29, 2025 •

edited

Loading