Skip to content

Conversation

valadaptive
Copy link
Contributor

@valadaptive valadaptive commented Aug 24, 2025

Rustybuzz has gone pretty much dormant, and HarfRust is its unofficial successor. It lives under the HarfBuzz organization, and instead of ttf-parser, it uses read-fonts from Fontations (which Swash also uses, albeit a different version).

A few tweaks to the code here lets us get rid of ttf-parser as well, and use Skrifa (a higher-level library built atop read-fonts) instead.

Once the version of Skrifa used in Swash is bumped, this should consolidate dependencies even further.

The last consumer of ttf-parser is fontdb, which uses an ancient version of it and apparently "isn't maintained anymore".

I can't see any difference in the Hebrew snapshot tests, unless VS Code's LFS diff previews are broken or something. It looks like these tests just always fail locally.

@valadaptive
Copy link
Contributor Author

valadaptive commented Aug 24, 2025

Shaping performance is currently ~25-40% worse with HarfRust on text_shaping_benchmarks. This mostly seems to be due to the cost of constructing a ShapePlan--parsing the script and feature lists is far more expensive in HarfRust than in rustybuzz. Here's the Samply profile.

Should we be caching the ShapePlans? (Since the user_features array contains text ranges, it doesn't seem like we can reuse a ShapePlan to shape different pieces of text.)

(/cc @dfrg)

@jackpot51
Copy link
Member

I agree with switching to harfrust. I will need to run lots of tests before, and I'll be busy for the next couple of weeks.

@nicoburns
Copy link
Contributor

Should we be caching the ShapePlans? (Since the user_features array contains text ranges, it doesn't seem like we can reuse a ShapePlan to shape different pieces of text.)

I'm not exactly an expert, but everytime this comes up caching ShapePlans is mentioned as "a very important optimisation", so I think you definitely should be doing that.

See also:

  • Some API changes harfbuzz/harfrust#57 HarfRust PR which discusses the design behind it's API
  • Cache HarfRust structs linebender/parley#406 Parley PR adding caching to it's HarfRust usage. Note: all of ShaperData, ShaperInstance and ShapePlan being cached, and that the cache is a smallish LRU cache (currently 16 entries) with the idea being that there are typically only a few unique fonts (/configurations of fonts) being used in any given run of text

@valadaptive
Copy link
Contributor Author

Maybe I still don't get how the shape plan API works, because the presence of the features array seems to indicate that each shape plan is meant to shape just one run of text. Sure, if you don't pass any features or each feature's span is infinite (as is the case here), then you can reuse them. But if an API consumer did genuinely want to pass in features that are only enabled for certain spans of text, then each shape plan would be tied to those features and the span indices to which they apply.

@nicoburns
Copy link
Contributor

Maybe I still don't get how the shape plan API works, because the presence of the features array seems to indicate that each shape plan is meant to shape just one run of text. Sure, if you don't pass any features or each feature's span is infinite (as is the case here), then you can reuse them. But if an API consumer did genuinely want to pass in features that are only enabled for certain spans of text, then each shape plan would be tied to those features and the span indices to which they apply.

I think the idea is that it's common to have large amounts of text with exactly the same font, features, etc. It's also common to have a few styles that are switched between, for example "switch into bold (or italic) for one word/sentence and then back to regular text" or even "switch into the heading style for a run of text and then back to the body style". So if you cache a few ShapePlans then there's a good chance that you'll be able to reuse one of the existing ones for the next text you come to shape.

If you make that cache persistent, then you can also likely reuse that cache across frames where e.g. you change the text content but keep the same styles.

@dfrg
Copy link

dfrg commented Aug 27, 2025

Maybe I still don't get how the shape plan API works, because the presence of the features array seems to indicate that each shape plan is meant to shape just one run of text. Sure, if you don't pass any features or each feature's span is infinite (as is the case here), then you can reuse them. But if an API consumer did genuinely want to pass in features that are only enabled for certain spans of text, then each shape plan would be tied to those features and the span indices to which they apply.

The short answer is that you can reuse a plan as long the feature sets are the same with regard to tag, value and whether or not the feature is global. That is, the actual indices of the range limited features don’t matter when constructing a shape plan. This is the same behavior as HarfBuzz.

edit: the reason is that non-global features require mask bits to be allocated and those allocations are fixed in the plan.

@valadaptive
Copy link
Contributor Author

The short answer is that you can reuse a plan as long the feature sets are the same with regard to tag, value and whether or not the feature is global. That is, the actual indices of the range limited features don’t matter when constructing a shape plan. This is the same behavior as HarfBuzz.

edit: the reason is that non-global features require mask bits to be allocated and those allocations are fixed in the plan.

This would be a great thing to put in the documentation!

@jackpot51
Copy link
Member

HarfRust recently had a new release, I'd recommend to update this PR to use it as it had performance improvements.

@jackpot51
Copy link
Member

Thanks! This next week I will be evaluating this.

@jackpot51 jackpot51 self-assigned this Sep 7, 2025
@jackpot51 jackpot51 moved this to Beta in COSMIC Epoch 1 Sep 7, 2025
@valadaptive
Copy link
Contributor Author

I've updated HarfRust, which provides a ~3% perf boost. Much more significantly, I've added a shape_plan_cache to go along with the shape_run_cache. This provides massive speedups (3-4x) compared to current main:

Current
ShapeLine/ASCII Fast Path
                        time:   [1.5013 ms 1.5034 ms 1.5064 ms]
                        change: [+244.72% +245.83% +247.13%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

ShapeLine/BiDi Processing
                        time:   [2.9482 ms 2.9506 ms 2.9531 ms]
                        change: [+187.90% +188.80% +189.48%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

ShapeLine/Layout Heavy  time:   [3.2056 ms 3.2113 ms 3.2192 ms]
                        change: [+283.52% +285.92% +288.05%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

ShapeLine/Combined Stress
                        time:   [17.280 ms 17.303 ms 17.328 ms]
                        change: [+240.02% +240.59% +241.23%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

BidiParagraphs/ASCII    time:   [2.8834 µs 2.8856 µs 2.8879 µs]
                        change: [-0.2879% -0.0953% +0.1065%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

BidiParagraphs/Mixed    time:   [33.798 µs 33.828 µs 33.861 µs]
                        change: [-3.6776% -3.5828% -3.4934%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  16 (16.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
This PR w/ `shape_plan_cache`
  ShapeLine/ASCII Fast Path
                        time:   [440.47 µs 441.37 µs 442.42 µs]
                        change: [-70.930% -70.832% -70.748%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mildad
  2 (2.00%) high severe

Benchmarking ShapeLine/BiDi Processing: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.1s, enable flat sampling, or reduce sample count to 60.
ShapeLine/BiDi Processing
                        time:   [1.0094 ms 1.0103 ms 1.0111 ms]
                        change: [-65.820% -65.778% -65.736%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

ShapeLine/Layout Heavy  time:   [828.27 µs 829.95 µs 831.82 µs]
                        change: [-74.280% -74.207% -74.145%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

ShapeLine/Combined Stress
                        time:   [5.0084 ms 5.0170 ms 5.0280 ms]
                        change: [-71.070% -71.006% -70.933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

BidiParagraphs/ASCII    time:   [2.8644 µs 2.8662 µs 2.8684 µs]
                        change: [-0.6308% -0.4049% -0.1998%] (p = 0.00 < 0.05)
                        Change within noise threshold.

BidiParagraphs/Mixed    time:   [35.050 µs 35.106 µs 35.186 µs]
                        change: [+3.2408% +3.3620% +3.4938%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Not sure how I feel about adding another cache that the user has to clear on their own, or whether it's "cheating" the benchmark to use it.

It looks like the most granular shaping level that's publicly exposed to the API consumer is line-level (via BufferLine::layout), so we could move the shape plan cache to ShapeBuffer and clear it after every shaped line. This results in worse performance though--700µs/iter for "ShapeLine/ASCII Fast Path" (vs 450µs/iter for the current shape_plan_cache), and 1.7ms/iter for "ShapeLine/BiDi Processing" (vs 1ms/iter). If the privacy of APIs was more carefully considered and we didn't need to guarantee that laying out single lines over and over didn't blow up the cache and cause unbounded memory consumption, we could clear the cache at layout-level granularity, which should amortize the cost much better.

In general, caching seems to be a bit of a mess currently. There's shape_run_cache which the user needs to clear manually, but also font_matches_cache which has a fixed capacity of 256 (which could end up being a cliff). Maybe it's better to just use LRU or LFU caches everywhere and allow the user to configure their capacities? Although for UI purposes, an "increment cache generation and prune" operation does make sense since we can always run it after each frame...

@jackpot51
Copy link
Member

Shape run cache is optional. We had a shape plan cache but it was removed for performance and memory usage reasons. So long as it won't grow forever and generally increases performance (try the UHDR sample linked in the README) then it should be ok.

@jackpot51
Copy link
Member

I agree with improving caching control generally.

@valadaptive
Copy link
Contributor Author

I put a simple "least recently added" VecDeque cache which stores the 6 most-recently-added shape plans, and uses harfrust::ShapePlanKey<'_> to check whether any match the current shaper options, into the ShapeBuffer struct. harfrust::ShapePlanKey<'_> does not implement Hash and has a lifetime parameter, so it's intended to be ephemeral and cannot be used to key a HashMap. But it works well for checking the equality of the (at most 6) shape plans in our cache. This approach avoids making the user clear the cache themselves, and seems to be slightly faster than hashing.

At first, I tried storing just the single most recently used shape plan, but we shape whitespace separately, and its script will always be Zzzz (unknown). So, shaping e.g. Latin text would ping-pong between a shape plan with the Latn script (for a word) -> a shape plan with the Zzzz script (for the space) -> Latn again, and so on. I chose 6 plans because that's what seems to be necessary for the "BiDi Processing" benchmark.

I added a benchmark which shapes sample/hello.txt, and that is slower with the VecDeque than the HashMap-based approach--it seems to attempt shaping with hundreds of different fonts, easily overflowing the cache. IMO, the root cause of the slowdown is that we're shaping the same piece of text hundreds of times during font fallback.

@WatchMkr WatchMkr added this to the beta milestone Sep 8, 2025
@jackpot51
Copy link
Member

Are there further optimizations you want to try, or is this ready for merge?

@valadaptive
Copy link
Contributor Author

This is ready to merge now. If I can think of any more optimizations, I'll leave them for the future.

@jackpot51 jackpot51 merged commit 2610c86 into pop-os:main Sep 9, 2025
1 of 2 checks passed
@WatchMkr WatchMkr moved this from Beta to Complete in COSMIC Epoch 1 Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Complete
Development

Successfully merging this pull request may close these issues.

5 participants