Simplify the looping structure of bitmap scanning (#2952)

ggreif · web-flow · commit 9bd0c7dbd785 · 2021-12-09T12:47:01.000Z
This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former. This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word. Unrolling comes at the cost of a few more instructions in the else leg, but since we `return`, we can eliminate the inner `loop` altogether, which is a win especially for sparse bitmaps. ## Benchmarks This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the `cancan` profile creation benchmark is available, and the GC-relevant cycle-count improvement is ``` shell [nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100" 2.5960501090954153 ``` about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952. ## Implementation concerns We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that. We could also refrain from testing the lowest bit and go for the `ctz` directly, but that could result in worse generated code by `wasmtime` (?). OTOH that would probably eliminate the `if`, and branchless code is good! N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.
diff --git a/rts/motoko-rts-tests/src/gc/heap.rs b/rts/motoko-rts-tests/src/gc/heap.rs
@@ -206,7 +206,7 @@ impl MotokoHeapInner {
 
         // The Worst-case unalignment w.r.t. 32-byte alignment is 28 (assuming
         // that we have general word alignment). So we over-allocate 28 bytes.
-        let mut heap: Vec<u8> = vec![0; heap_size + 28];
+        let mut heap = vec![0u8; heap_size + 28];
 
         // MarkCompact assumes that the dynamic heap starts at a 32-byte multiple
         let realign = match gc {
diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
@@ -167,8 +167,8 @@ impl BitmapIter {
 
         // Outer loop iterates 64-bit words
         loop {
-            // Inner loop iterates bits in the current word
-            while self.current_word != 0 {
+            // Inner conditional examines the least significant bit(s) in the current word
+            if self.current_word != 0 {
                 if self.current_word & 0b1 != 0 {
                     let bit_idx = self.current_bit_idx;
                     self.current_word >>= 1;
@@ -177,12 +177,21 @@ impl BitmapIter {
                 } else {
                     let shift_amt = self.current_word.trailing_zeros();
                     self.current_word >>= shift_amt;
-                    self.current_bit_idx += shift_amt;
+                    self.current_word >>= 1;
+                    let bit_idx = self.current_bit_idx + shift_amt;
+                    self.current_bit_idx = bit_idx + 1;
+                    return bit_idx;
                 }
             }
 
             // Move on to next word (always 64-bit boundary)
             self.current_bit_idx += self.leading_zeros;
+            unsafe {
+                debug_assert_eq!(
+                    (self.current_bit_idx - get_bitmap_forbidden_size() as u32 * 8) % 64,
+                    0
+                )
+            }
             if self.current_bit_idx == self.size {
                 return BITMAP_ITER_END;
             }