Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel #521
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What do these changes do?
The 8x4x2 indirect bgemm kernel uses
ld1r
"load and replicate into all lanes" instructions to load 64-bits of activations data and replicate it across the Neon register. According to the software optimisation guide of both the Cortex-A72 and Cortex-A76, theseld1r
instructions have a latency of 8 cycles and use the L (load) and F0/F1 (floating point/ASIMD) Neon pipelines. Conversely, theld1
instruction variant that loads a 64-bit value into a single lane of a register has a latency of 5 cycles and uses only the L pipeline. This raises the possibility of replacing the 64-bitld1r
with twold1
instructions: one for the lower half and one for the upper half of the register. The advantage of this is that the loads won't utilise the F0/F1 pipelines, which ensures maximum throughput of ALU instructions (eor
,cnt
).This PR replaces the single
ld1r
instructions with pairs of single-elementld1
instructions for doing the register loads in the 8x4x2 kernel.Note that the same trick would be possible in the 8x4x1 kernels, but not beneficial because the single-lane loads of 32-bit elements utilise the F0/F1 pipelines just like the
ld1r
loads do.How Has This Been Tested?
CI.
Benchmark Results
The performance benefit here is slight but persistent. I benchmarked the QuickNet model family on my Raspberry Pi 4B board running Ubuntu with
num_runs=250
, and report the average latency in ms and the standard deviation below.Related issue number
N/A.