Skip to content

Conversation

AdamHillier
Copy link
Contributor

What do these changes do?

The 8x4x2 indirect bgemm kernel uses ld1r "load and replicate into all lanes" instructions to load 64-bits of activations data and replicate it across the Neon register. According to the software optimisation guide of both the Cortex-A72 and Cortex-A76, these ld1r instructions have a latency of 8 cycles and use the L (load) and F0/F1 (floating point/ASIMD) Neon pipelines. Conversely, the ld1 instruction variant that loads a 64-bit value into a single lane of a register has a latency of 5 cycles and uses only the L pipeline. This raises the possibility of replacing the 64-bit ld1r with two ld1 instructions: one for the lower half and one for the upper half of the register. The advantage of this is that the loads won't utilise the F0/F1 pipelines, which ensures maximum throughput of ALU instructions (eor, cnt).

This PR replaces the single ld1r instructions with pairs of single-element ld1 instructions for doing the register loads in the 8x4x2 kernel.

Note that the same trick would be possible in the 8x4x1 kernels, but not beneficial because the single-lane loads of 32-bit elements utilise the F0/F1 pipelines just like the ld1r loads do.

How Has This Been Tested?

CI.

Benchmark Results

The performance benefit here is slight but persistent. I benchmarked the QuickNet model family on my Raspberry Pi 4B board running Ubuntu with num_runs=250, and report the average latency in ms and the standard deviation below.

Model Baseline indirect BGEMM PR indirect BGEMM % change
QuickNet 28.93 +- 0.06 28.77 +- 0.05 -0.6%
QuickNetLarge 43.67 +- 0.07 43.29 +- 0.06 -0.9%
QuickNetXL 75.07 +- 0.07 74.66 +- 0.07 -0.5%

Related issue number

N/A.

@AdamHillier AdamHillier added the internal-improvement Internal Improvements and Maintenance label Oct 1, 2020
@AdamHillier AdamHillier requested a review from a team October 1, 2020 18:00
@AdamHillier AdamHillier added performance Improvements related to inference speed and removed internal-improvement Internal Improvements and Maintenance labels Oct 1, 2020
Copy link
Member

@lgeiger lgeiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find 👍

Copy link
Collaborator

@Tombana Tombana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. How do you feel about adding a comment to the file with something along the lines of

// The separated ld1 instructions are faster than the ld1r instruction,
// see https://github.com/larq/compute-engine/pull/521

@AdamHillier
Copy link
Contributor Author

Nice. How do you feel about adding a comment to the file with something along the lines of

// The separated ld1 instructions are faster than the ld1r instruction,
// see https://github.com/larq/compute-engine/pull/521

Ah yeah, will do 👍

@AdamHillier AdamHillier force-pushed the indirect-bgemm-8x4x2-loads branch from 760a0f1 to bbec302 Compare October 2, 2020 10:07
@AdamHillier AdamHillier merged commit 811f145 into master Oct 2, 2020
@AdamHillier AdamHillier deleted the indirect-bgemm-8x4x2-loads branch October 2, 2020 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements related to inference speed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants