Skip to content

Optimize out some read operations from the fast deflate algorithm #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 29, 2025

Conversation

brian-pane
Copy link

No description provided.

Copy link

codecov bot commented May 29, 2025

Codecov Report

Attention: Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zlib-rs/src/deflate/algorithm/fast.rs 97.87% 1 Missing ⚠️
Flag Coverage Δ
fuzz-compress ?
fuzz-decompress ?
test-aarch64-apple-darwin 93.36% <97.87%> (-0.02%) ⬇️
test-x86_64-apple-darwin 91.70% <97.87%> (+<0.01%) ⬆️
test-x86_64-unknown-linux-gnu 90.43% <97.87%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
zlib-rs/src/deflate/hash_calc.rs 100.00% <ø> (ø)
zlib-rs/src/deflate/algorithm/fast.rs 96.51% <97.87%> (-1.05%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@brian-pane
Copy link
Author

Notes:

  • This includes a mild restructuring of the loop logic inside deflate_fast, which I did to make the logic easier to reason about while making changes. But I think the result might be easier for the compiler to optimize, especially on targets with limited registers, now that local values like match_len aren't maintained across loop iterations.
  • I had to add an additional #[inline] in hash_calc.rs to keep this from causing a regression at compression level 1.
  • It might be possible to add an additional speedup in the future: if lookahead is >= 8, fetch the next 8 bytes instead of just 4. We could then use the first 4 of those bytes for the quick_insert_value call and then pass all 8 to longest_match to save another read from the window.

@brian-pane
Copy link
Author

Benchmark 1 (72 runs): ./blogpost-compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          70.0ms ± 1.29ms    68.8ms … 74.4ms          6 ( 8%)        0%
  peak_rss           26.6MB ± 66.4KB    26.4MB … 26.8MB          1 ( 1%)        0%
  cpu_cycles          267M  ±  921K      265M  …  271M           3 ( 4%)        0%
  instructions        524M  ±  277       524M  …  524M           0 ( 0%)        0%
  cache_references    264K  ± 5.95K      261K  …  302K           7 (10%)        0%
  cache_misses        229K  ± 9.91K      198K  …  246K          11 (15%)        0%
  branch_misses      2.85M  ± 3.78K     2.84M  … 2.85M           0 ( 0%)        0%
Benchmark 2 (73 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          69.4ms ± 1.09ms    68.1ms … 75.2ms          5 ( 7%)          -  0.9% ±  0.6%
  peak_rss           26.7MB ± 77.2KB    26.5MB … 26.8MB          0 ( 0%)          +  0.2% ±  0.1%
  cpu_cycles          265M  ± 1.09M      263M  …  270M           4 ( 5%)          -  0.8% ±  0.1%
  instructions        509M  ±  324       509M  …  509M           0 ( 0%)        ⚡-  2.8% ±  0.0%
  cache_references    264K  ± 3.35K      260K  …  281K           8 (11%)          -  0.2% ±  0.6%
  cache_misses        229K  ± 10.8K      193K  …  244K          11 (15%)          -  0.3% ±  1.5%
  branch_misses      2.92M  ± 4.85K     2.91M  … 2.93M           0 ( 0%)        💩+  2.6% ±  0.0%
Benchmark 1 (42 runs): ./blogpost-compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           120ms ±  791us     119ms …  122ms          0 ( 0%)        0%
  peak_rss           24.9MB ± 73.0KB    24.8MB … 25.1MB          0 ( 0%)        0%
  cpu_cycles          487M  ± 1.47M      483M  …  491M           1 ( 2%)        0%
  instructions       1.07G  ±  317      1.07G  … 1.07G           1 ( 2%)        0%
  cache_references    267K  ± 2.86K      263K  …  276K           1 ( 2%)        0%
  cache_misses        234K  ± 3.56K      227K  …  243K           0 ( 0%)        0%
  branch_misses      6.19M  ± 6.27K     6.17M  … 6.21M           3 ( 7%)        0%
Benchmark 2 (43 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           117ms ±  813us     116ms …  120ms          1 ( 2%)        ⚡-  2.4% ±  0.3%
  peak_rss           25.0MB ± 77.6KB    24.8MB … 25.1MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          473M  ± 1.41M      471M  …  479M           2 ( 5%)        ⚡-  2.7% ±  0.1%
  instructions       1.05G  ±  453      1.05G  … 1.05G           1 ( 2%)        ⚡-  2.1% ±  0.0%
  cache_references    272K  ± 14.3K      264K  …  325K           6 (14%)          +  2.1% ±  1.7%
  cache_misses        233K  ± 8.68K      204K  …  243K           3 ( 7%)          -  0.4% ±  1.2%
  branch_misses      6.19M  ± 4.89K     6.18M  … 6.20M           1 ( 2%)          +  0.0% ±  0.0%
Benchmark 1 (37 runs): ./blogpost-compress-baseline 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           137ms ± 4.45ms     135ms …  162ms          2 ( 5%)        0%
  peak_rss           24.7MB ± 62.1KB    24.6MB … 24.8MB          0 ( 0%)        0%
  cpu_cycles          563M  ± 19.0M      558M  …  673M           4 (11%)        0%
  instructions       1.40G  ±  266      1.40G  … 1.40G           0 ( 0%)        0%
  cache_references    271K  ± 7.42K      264K  …  305K           2 ( 5%)        0%
  cache_misses        233K  ± 6.48K      203K  …  245K           3 ( 8%)        0%
  branch_misses      7.04M  ± 3.99K     7.03M  … 7.05M           2 ( 5%)        0%
Benchmark 2 (37 runs): ./target/release/examples/blogpost-compress 3 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           136ms ± 1.64ms     135ms …  143ms          2 ( 5%)          -  0.5% ±  1.1%
  peak_rss           24.7MB ± 87.5KB    24.6MB … 24.9MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          559M  ± 4.02M      557M  …  581M           3 ( 8%)          -  0.8% ±  1.1%
  instructions       1.39G  ±  322      1.39G  … 1.39G           1 ( 3%)        ⚡-  1.2% ±  0.0%
  cache_references    270K  ± 9.24K      262K  …  314K           3 ( 8%)          -  0.1% ±  1.4%
  cache_misses        233K  ± 6.52K      210K  …  243K           2 ( 5%)          +  0.0% ±  1.3%
  branch_misses      7.06M  ± 4.64K     7.05M  … 7.07M           1 ( 3%)          +  0.3% ±  0.0%
Benchmark 1 (32 runs): ./blogpost-compress-baseline 4 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           160ms ± 3.08ms     158ms …  175ms          2 ( 6%)        0%
  peak_rss           24.5MB ± 72.7KB    24.4MB … 24.7MB          0 ( 0%)        0%
  cpu_cycles          661M  ± 4.21M      658M  …  681M           3 ( 9%)        0%
  instructions       1.50G  ±  450      1.50G  … 1.50G           2 ( 6%)        0%
  cache_references    279K  ± 23.1K      265K  …  382K           4 (13%)        0%
  cache_misses        233K  ± 6.84K      212K  …  250K           4 (13%)        0%
  branch_misses      7.55M  ± 5.39K     7.54M  … 7.57M           3 ( 9%)        0%
Benchmark 2 (32 runs): ./target/release/examples/blogpost-compress 4 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           159ms ±  907us     158ms …  162ms          2 ( 6%)          -  0.7% ±  0.7%
  peak_rss           24.6MB ± 74.7KB    24.5MB … 24.7MB          0 ( 0%)          +  0.2% ±  0.2%
  cpu_cycles          658M  ±  877K      656M  …  660M           0 ( 0%)          -  0.5% ±  0.2%
  instructions       1.48G  ±  314      1.48G  … 1.48G           0 ( 0%)        ⚡-  1.2% ±  0.0%
  cache_references    271K  ± 7.86K      264K  …  302K           1 ( 3%)          -  2.9% ±  3.1%
  cache_misses        234K  ± 7.12K      207K  …  244K           2 ( 6%)          +  0.5% ±  1.5%
  branch_misses      7.62M  ± 6.15K     7.60M  … 7.63M           0 ( 0%)          +  0.9% ±  0.0%
Benchmark 1 (29 runs): ./blogpost-compress-baseline 5 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           176ms ± 1.01ms     175ms …  179ms          4 (14%)        0%
  peak_rss           24.5MB ± 84.7KB    24.3MB … 24.7MB          1 ( 3%)        0%
  cpu_cycles          733M  ± 2.45M      731M  …  744M           1 ( 3%)        0%
  instructions       1.73G  ±  307      1.73G  … 1.73G           0 ( 0%)        0%
  cache_references    270K  ± 3.94K      265K  …  283K           1 ( 3%)        0%
  cache_misses        233K  ± 6.81K      206K  …  242K           5 (17%)        0%
  branch_misses      8.28M  ± 32.3K     8.22M  … 8.35M           0 ( 0%)        0%
Benchmark 2 (29 runs): ./target/release/examples/blogpost-compress 5 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           175ms ±  550us     175ms …  177ms          1 ( 3%)          -  0.6% ±  0.2%
  peak_rss           24.6MB ± 84.1KB    24.5MB … 24.7MB          0 ( 0%)          +  0.2% ±  0.2%
  cpu_cycles          731M  ±  578K      731M  …  733M           1 ( 3%)          -  0.2% ±  0.1%
  instructions       1.70G  ±  336      1.70G  … 1.70G           0 ( 0%)        ⚡-  1.6% ±  0.0%
  cache_references    268K  ± 2.88K      264K  …  276K           0 ( 0%)          -  0.8% ±  0.7%
  cache_misses        231K  ± 8.21K      203K  …  238K           4 (14%)          -  0.9% ±  1.7%
  branch_misses      8.25M  ± 15.6K     8.22M  … 8.31M           2 ( 7%)          -  0.4% ±  0.2%
Benchmark 1 (24 runs): ./blogpost-compress-baseline 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           217ms ±  746us     215ms …  219ms          1 ( 4%)        0%
  peak_rss           24.5MB ± 79.7KB    24.3MB … 24.7MB          1 ( 4%)        0%
  cpu_cycles          907M  ± 1.35M      906M  …  910M           0 ( 0%)        0%
  instructions       1.90G  ±  356      1.90G  … 1.90G           2 ( 8%)        0%
  cache_references    275K  ± 16.9K      265K  …  346K           4 (17%)        0%
  cache_misses        233K  ± 7.84K      206K  …  238K           2 ( 8%)        0%
  branch_misses      8.44M  ± 30.0K     8.39M  … 8.50M           0 ( 0%)        0%
Benchmark 2 (24 runs): ./target/release/examples/blogpost-compress 6 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           217ms ± 1.52ms     215ms …  221ms          3 (13%)          -  0.0% ±  0.3%
  peak_rss           24.6MB ± 67.8KB    24.5MB … 24.7MB          0 ( 0%)          +  0.1% ±  0.2%
  cpu_cycles          905M  ±  516K      904M  …  906M           0 ( 0%)          -  0.2% ±  0.1%
  instructions       1.87G  ±  353      1.87G  … 1.87G           0 ( 0%)        ⚡-  1.5% ±  0.0%
  cache_references    272K  ± 11.3K      265K  …  315K           3 (13%)          -  1.0% ±  3.1%
  cache_misses        232K  ± 9.62K      207K  …  245K           4 (17%)          -  0.4% ±  2.2%
  branch_misses      8.39M  ± 7.52K     8.38M  … 8.41M           0 ( 0%)          -  0.5% ±  0.2%
Benchmark 1 (17 runs): ./blogpost-compress-baseline 7 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           300ms ± 2.76ms     295ms …  309ms          2 (12%)        0%
  peak_rss           24.4MB ± 58.2KB    24.3MB … 24.5MB          0 ( 0%)        0%
  cpu_cycles         1.26G  ± 6.34M     1.26G  … 1.28G           1 ( 6%)        0%
  instructions       2.30G  ±  279      2.30G  … 2.30G           0 ( 0%)        0%
  cache_references    273K  ± 8.60K      266K  …  295K           3 (18%)        0%
  cache_misses        233K  ± 8.21K      211K  …  245K           4 (24%)        0%
  branch_misses      9.56M  ± 7.53K     9.55M  … 9.57M           0 ( 0%)        0%
Benchmark 2 (17 runs): ./target/release/examples/blogpost-compress 7 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           298ms ± 1.47ms     294ms …  301ms          1 ( 6%)          -  0.7% ±  0.5%
  peak_rss           24.5MB ± 77.4KB    24.4MB … 24.6MB          0 ( 0%)          +  0.3% ±  0.2%
  cpu_cycles         1.25G  ± 3.88M     1.25G  … 1.27G           2 (12%)          -  0.4% ±  0.3%
  instructions       2.27G  ±  291      2.27G  … 2.27G           0 ( 0%)        ⚡-  1.1% ±  0.0%
  cache_references    269K  ± 4.03K      264K  …  282K           1 ( 6%)          -  1.5% ±  1.7%
  cache_misses        233K  ± 7.61K      212K  …  240K           2 (12%)          -  0.3% ±  2.4%
  branch_misses      9.53M  ± 10.9K     9.50M  … 9.55M           1 ( 6%)          -  0.3% ±  0.1%
Benchmark 1 (13 runs): ./blogpost-compress-baseline 8 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           393ms ± 1.98ms     387ms …  396ms          1 ( 8%)        0%
  peak_rss           24.4MB ± 70.9KB    24.3MB … 24.5MB          0 ( 0%)        0%
  cpu_cycles         1.66G  ± 1.07M     1.66G  … 1.66G           0 ( 0%)        0%
  instructions       2.75G  ±  261      2.75G  … 2.75G           0 ( 0%)        0%
  cache_references    274K  ± 12.7K      267K  …  315K           1 ( 8%)        0%
  cache_misses        232K  ± 8.00K      214K  …  240K           2 (15%)        0%
  branch_misses      9.69M  ± 10.7K     9.67M  … 9.70M           0 ( 0%)        0%
Benchmark 2 (13 runs): ./target/release/examples/blogpost-compress 8 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           392ms ± 2.11ms     386ms …  394ms          1 ( 8%)          -  0.3% ±  0.4%
  peak_rss           24.5MB ± 89.1KB    24.4MB … 24.6MB          0 ( 0%)          +  0.1% ±  0.3%
  cpu_cycles         1.66G  ± 1.36M     1.65G  … 1.66G           0 ( 0%)          -  0.3% ±  0.1%
  instructions       2.72G  ±  427      2.72G  … 2.72G           2 (15%)          -  1.0% ±  0.0%
  cache_references    270K  ± 5.15K      265K  …  281K           0 ( 0%)          -  1.4% ±  2.9%
  cache_misses        237K  ± 1.81K      233K  …  240K           1 ( 8%)          +  2.1% ±  2.0%
  branch_misses      9.66M  ± 11.8K     9.64M  … 9.68M           0 ( 0%)          -  0.3% ±  0.1%
Benchmark 1 (12 runs): ./blogpost-compress-baseline 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           420ms ± 1.90ms     416ms …  425ms          2 (17%)        0%
  peak_rss           24.4MB ± 68.2KB    24.2MB … 24.5MB          4 (33%)        0%
  cpu_cycles         1.77G  ± 2.77M     1.77G  … 1.78G           0 ( 0%)        0%
  instructions       3.38G  ±  319      3.38G  … 3.38G           0 ( 0%)        0%
  cache_references    274K  ± 10.4K      265K  …  299K           0 ( 0%)        0%
  cache_misses        228K  ± 9.65K      216K  …  241K           0 ( 0%)        0%
  branch_misses      15.8M  ± 27.0K     15.8M  … 15.8M           0 ( 0%)        0%
Benchmark 2 (12 runs): ./target/release/examples/blogpost-compress 9 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           421ms ± 1.03ms     420ms …  424ms          1 ( 8%)          +  0.2% ±  0.3%
  peak_rss           24.4MB ± 94.8KB    24.2MB … 24.6MB          0 ( 0%)          +  0.2% ±  0.3%
  cpu_cycles         1.78G  ± 4.21M     1.77G  … 1.79G           1 ( 8%)          +  0.2% ±  0.2%
  instructions       3.41G  ±  190      3.41G  … 3.41G           0 ( 0%)          +  0.9% ±  0.0%
  cache_references    282K  ± 28.6K      265K  …  366K           1 ( 8%)          +  2.8% ±  6.6%
  cache_misses        235K  ± 3.25K      231K  …  240K           0 ( 0%)          +  3.1% ±  2.7%
  branch_misses      15.5M  ± 24.4K     15.5M  … 15.6M           3 (25%)        ⚡-  1.7% ±  0.1%

Comment on lines -40 to +42
/* Find the longest match, discarding those <= prev_length.
* At this point we have always match length < WANT_MIN_MATCH
*/
// Find the longest match for the string starting at offset state.strstart.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we lose the property that

At this point we have always match length < WANT_MIN_MATCH

I'd assume not, in which case I'd like to keep that bit of the comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new version, match_length doesn't even exist until after that point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, the diff made that hard to see. Now the check for the length is right below the definition, so that will do.

Comment on lines -12 to -14
let mut bflush; /* set if current block must be flushed */
let mut dist;
let mut match_len = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically these variables were pre-declared as an optimization I think. But I do buy that it's not needed any more and in fact having the scope be more restricted might help in re-using registers/stack slots.

Copy link
Collaborator

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent work, these are my local numbers

Benchmark 2 (41 runs): target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           125ms ± 1.51ms     123ms …  132ms          1 ( 2%)        ⚡-  2.9% ±  0.6%
  peak_rss           24.9MB ± 73.2KB    24.8MB … 25.0MB          0 ( 0%)          -  0.0% ±  0.2%
  cpu_cycles          500M  ± 5.04M      493M  …  523M           2 ( 5%)        ⚡-  3.6% ±  0.6%
  instructions       1.08G  ±  379      1.08G  … 1.08G           0 ( 0%)          -  0.6% ±  0.0%
  cache_references   33.7M  ±  339K     33.3M  … 34.8M           1 ( 2%)          +  0.2% ±  0.5%
  cache_misses       1.14M  ±  181K      794K  … 1.79M           1 ( 2%)          +  8.9% ± 10.5%
  branch_misses      6.26M  ± 3.11K     6.26M  … 6.27M           2 ( 5%)          -  0.1% ±  0.0%

with all of the other levels having no significant changes.

Comment on lines -40 to +42
/* Find the longest match, discarding those <= prev_length.
* At this point we have always match length < WANT_MIN_MATCH
*/
// Find the longest match for the string starting at offset state.strstart.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, the diff made that hard to see. Now the check for the length is right below the definition, so that will do.

@folkertdev folkertdev merged commit 7fafed0 into trifectatechfoundation:main May 29, 2025
24 checks passed
@brian-pane brian-pane deleted the reuse-read branch May 29, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants