Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel #521

AdamHillier · 2020-10-01T18:00:20Z

What do these changes do?

The 8x4x2 indirect bgemm kernel uses ld1r "load and replicate into all lanes" instructions to load 64-bits of activations data and replicate it across the Neon register. According to the software optimisation guide of both the Cortex-A72 and Cortex-A76, these ld1r instructions have a latency of 8 cycles and use the L (load) and F0/F1 (floating point/ASIMD) Neon pipelines. Conversely, the ld1 instruction variant that loads a 64-bit value into a single lane of a register has a latency of 5 cycles and uses only the L pipeline. This raises the possibility of replacing the 64-bit ld1r with two ld1 instructions: one for the lower half and one for the upper half of the register. The advantage of this is that the loads won't utilise the F0/F1 pipelines, which ensures maximum throughput of ALU instructions (eor, cnt).

This PR replaces the single ld1r instructions with pairs of single-element ld1 instructions for doing the register loads in the 8x4x2 kernel.

Note that the same trick would be possible in the 8x4x1 kernels, but not beneficial because the single-lane loads of 32-bit elements utilise the F0/F1 pipelines just like the ld1r loads do.

How Has This Been Tested?

CI.

Benchmark Results

The performance benefit here is slight but persistent. I benchmarked the QuickNet model family on my Raspberry Pi 4B board running Ubuntu with num_runs=250, and report the average latency in ms and the standard deviation below.

Model	Baseline indirect BGEMM	PR indirect BGEMM	% change
QuickNet	28.93 +- 0.06	28.77 +- 0.05	-0.6%
QuickNetLarge	43.67 +- 0.07	43.29 +- 0.06	-0.9%
QuickNetXL	75.07 +- 0.07	74.66 +- 0.07	-0.5%

Related issue number

N/A.

lgeiger

Great find 👍

Tombana

Nice. How do you feel about adding a comment to the file with something along the lines of

// The separated ld1 instructions are faster than the ld1r instruction,
// see https://github.com/larq/compute-engine/pull/521

AdamHillier · 2020-10-02T09:57:09Z

Nice. How do you feel about adding a comment to the file with something along the lines of
// The separated ld1 instructions are faster than the ld1r instruction,
// see https://github.com/larq/compute-engine/pull/521

Ah yeah, will do 👍

AdamHillier added the internal-improvement Internal Improvements and Maintenance label Oct 1, 2020

AdamHillier requested a review from a team October 1, 2020 18:00

AdamHillier added performance Improvements related to inference speed and removed internal-improvement Internal Improvements and Maintenance labels Oct 1, 2020

lgeiger approved these changes Oct 1, 2020

View reviewed changes

Tombana approved these changes Oct 2, 2020

View reviewed changes

Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel

bbec302

AdamHillier force-pushed the indirect-bgemm-8x4x2-loads branch from 760a0f1 to bbec302 Compare October 2, 2020 10:07

AdamHillier merged commit 811f145 into master Oct 2, 2020

AdamHillier deleted the indirect-bgemm-8x4x2-loads branch October 2, 2020 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel #521

Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel #521

Uh oh!

AdamHillier commented Oct 1, 2020

Uh oh!

lgeiger left a comment

Uh oh!

Tombana left a comment

Uh oh!

AdamHillier commented Oct 2, 2020

Uh oh!

Uh oh!

Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel #521

Slightly optimise the loads in the 8x4x2 indirect BGEMM kernel #521

Uh oh!

Conversation

AdamHillier commented Oct 1, 2020

What do these changes do?

How Has This Been Tested?

Benchmark Results

Related issue number

Uh oh!

lgeiger left a comment

Choose a reason for hiding this comment

Uh oh!

Tombana left a comment

Choose a reason for hiding this comment

Uh oh!

AdamHillier commented Oct 2, 2020

Uh oh!

Uh oh!