You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simplify the looping structure of bitmap scanning (#2952)
This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former.
This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word.
Unrolling comes at the cost of a few more instructions in the else leg, but since we `return`, we can eliminate the inner `loop` altogether, which is a win especially for sparse bitmaps.
## Benchmarks
This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the `cancan` profile creation benchmark is available, and the GC-relevant cycle-count improvement is
``` shell
[nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100"
2.5960501090954153
```
about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952.
## Implementation concerns
We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that.
We could also refrain from testing the lowest bit and go for the `ctz` directly, but that could result in worse generated code by `wasmtime` (?). OTOH that would probably eliminate the `if`, and branchless code is good!
N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.
0 commit comments