[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

skykongkong8 · 2025-10-02T07:51:47Z

Dependency of the PR

None.

Summary

This Pull Request introduces even more optimized version of qsi4cxp_qs4cxs1s0 GEMM first propsed from #3497
Optimization technique s.t.:

neon SIMD
openMP based multithreading
automatic ukernel selection

As a result, this patch accelerated approximately x4 times faster GEMM computation latency:

// Tested on Galaxy S25U
[ RUN      ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048
BEFORE : 8639844 ns 8639 us 8 ms
AFTER : 2524531 ns 2524 us 2 ms

In my inspection, this is more than 2~3 times faster than previous Q4_0
FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms

For practical using sample, refer to unittest_nntrainer_cpu_backend_fp16.cpp

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>

github-actions · 2025-10-25T02:42:35Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 3 days.

github-actions · 2025-10-28T02:47:27Z

This PR was closed because it has been stalled for 3 days with no activity.

- armv8.2-a+fp16+dotprod -> armv8.2-a+fp16+dotprod+i8mm - Adding i8mm enables including high-performancing SIMD intrinsics **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- note : for cpu backend interface, need to add fallback function for this... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Assume the weight is offline-packed in qs4cxs1s0 manner, with its optimal ukernel idx - unittest TC says: [INFO] sgemm : 387812 ns 387 us 0 ms [INFO] test_gemm_qai8dxp_qsi4cxp_packed: 16667 ns 16 us 0 ms [INFO] MSE: 0.554387, COS_SIM: 0.998757, MAX_DIFFER: 3.13451, SUM: 267.005, SUM_GT: 300.489 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

…s1s0 format - todo: automatically returns optimal kernel variant idx and feed it to packed-TC **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Multithread with openMP, w.r.t. N-direction, coarse grained // Tested on Galaxy S23 - BEFORE test_gemm_qai8dxp_qsi4cxp_packed: 6934427 ns 6934 us 6 ms - AFTER test_gemm_qai8dxp_qsi4cxp_packed: 4398489 ns 4398 us 4 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- optimal kernel idx is not always consistent among the run. - In heuristic pov, I could get optimal kernel idx with multiple run, and chose the most frequently occuring one. (17 / 20) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- todo: support non-4-divisible-N case [ RUN ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048 BEFORE : 8639844 ns 8639 us 8 ms AFTER : 2524531 ns 2524 us 2 ms FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- nntr_gemm_qai8dxp_qsi4cxp_packed - nntr_qsi4cxp_qs4cxs1s0_rhs_pack - nntr_get_rhs_packed_size_qsi4cxp_qs4cxs1s0 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- kleidiai-based functions basically support fallback implementations, but some functions for specific purposes are only used in ARM. - Still for easier maintenance of cpu_backend, those function headers should be declared in the other sides as well. (Needs some other opinions though) - trivial doxygentags **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

Note that this operations expects: 1. Weight is transposed, 2. Weight is quantized in channel-wise scheme, 3. Weight is packed in (GEMV) 1 or (GEMM) 5 idx number, 4. Activation is FP32 (for it is implemented on float_tensor.cpp) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

jijoongmoon

LGTM

myungjoo · 2025-11-04T05:29:27Z

...tmul_clamp_f32_qai8dxp_qsi4cxp/kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.c

+size_t
+kai_get_m_step_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm(void) {
+  return kai_m_step;
+}


is this function really exported to other modules?

If this value is accessed by external modules in the middle of critical calculation (performance impacting), you can define this functin as a static-inline function in the header and have kai_m_step declared in the header. Moreoever, for conventional C++ principles, this is better to be a static value of a public class, removing all the worries.

What I worry: calling this function in a loop or other module.

I see your worries and let me explain.

First of all, kai_get_m_step_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm is not actually being called from anywhere in the current status.
Then why should we keep this function? -> Because it is component of kai_matmul_ukernel_f32_qa8dxp_qs4cxp struct, which is general format of using kleidiai GEMM kernels, and we imported only some of them, and planning to introduce more in the future. In that case, we might need some m_step calling functions, but even in such case, it will not be called in a loop, or in other modules except for neon_kleidiai.cpp

If you are worried about the latency-perspective, I think it won't affect the total GEMM latency meaningfully...
However, if you are worried about SW-design perspective, maybe we should undergo major refactoring for detaching all

struct kai_matmul_clamp_f32_qai8dxp_qsi4cxp_ukernel { kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_m_step_func_t get_m_step; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_n_step_func_t get_n_step; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_mr_func_t get_mr; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_nr_func_t get_nr; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_nr_func_t get_kr; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_sr_func_t get_sr; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_lhs_packed_offset_func_t get_lhs_packed_offset; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_rhs_packed_offset_func_t get_rhs_packed_offset; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_dst_offset_func_t get_dst_offset; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_dst_size_func_t get_dst_size; kai_matmul_clamp_f32_qai8dxp_qsi4cxp_run_matmul_func_t run_matmul; };

struct into static value of a public class since kleidiai is getting some const values with just normal functions.

myungjoo · 2025-11-04T07:08:38Z

nntrainer/tensor/cpu_backend/fallback/fallback_fp16.cpp

+  size_t m, size_t n, size_t k, void *lhs_native_mtx_f32,
+  void *rhs_native_mtx_qs4cx, void *rhs_scales_f32, float *dst_mtx_f32,
+  bool transB, float lower_bound, float upper_bound) {
+  __fallback_nntr_gemm_qai8dxp_qsi4cxp_unpacked(


Missing return?

Fixed! Sorry for trivial misses!

myungjoo · 2025-11-04T07:12:32Z

nntrainer/tensor/float_tensor.cpp

+#ifdef ENABLE_FP16
+  if (input.q_scheme() == QScheme::PER_CHANNEL_AFFINE) {
+    uint32_t opt_kernel_idx = (M == 1) ? 1 : 5;
+    nntr_gemm_qai8dxp_qsi4cxp_packed(


Missing return statement?

throw of LINE 983 is always called after this!

Aha, I missed this one! It should be

#else throw std::runtime_error( "Error: FP16 should be enabled for QINT4 Dot on CPU"); #endif

Not

#endif throw std::runtime_error( "Error: FP16 should be enabled for QINT4 Dot on CPU");

nntrainer/tensor/cpu_backend/fallback/fallback.h

myungjoo · 2025-11-04T07:17:07Z

nntrainer/tensor/cpu_backend/fallback/fallback.h

+ * @param rhs_scales_f32 matrix quant scale after quantization to stroe
+ * @param transB
+ */
+void nntr_quant_qs4cx_f32(size_t n, size_t k, void *rhs_native_mtx_f32,


double definition? (not built in our github action scenarios?)

Thanks for your thorough review! I wonder why this problem haven't detected previously..!

skykongkong8 requested review from DonghakPark, EunjuYang, SeoHyungjun, again4you, anyj0527, baek2sm, djeong20, dkjung, gichan-jang, haehun, jaeyun-jung, jihochu, jijoongmoon, leemgs, lhs8928, myungjoo, songgot and wooksong as code owners October 2, 2025 07:51

github-actions bot added the Need Review label Oct 2, 2025

skykongkong8 mentioned this pull request Oct 2, 2025

Supporting As8Wu4 computation flow #3489

Open

11 tasks

github-actions bot added the Stale label Oct 25, 2025

github-actions bot closed this Oct 28, 2025

skykongkong8 reopened this Nov 3, 2025

skykongkong8 added 5 commits November 3, 2025 09:21

[ arm ] Introduce i8mm-based f32_qai8dxp4x8_qsi4cxp4x8 matmul kernels

40b0b05

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ arm ] Add offline-weight-packed GEMM for qai8dxp_qsi4cxp

579ea37

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ arm ] Add packing param getter

70f92b9

- note : for cpu backend interface, need to add fallback function for this... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 added 10 commits November 3, 2025 09:21

[ unittest ] Support noTrans GEMM TC for kai

5d68bd8

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

[ cpu_backend ] Automatic idx computing for kernel variant idx

d8463a3

**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 force-pushed the poc/arm/kai/i8mmkernel+weightofflinepackingdetached branch from 31bbb8e to 3354330 Compare November 3, 2025 00:23

github-actions bot removed the Stale label Nov 3, 2025

jijoongmoon approved these changes Nov 4, 2025

View reviewed changes

jijoongmoon merged commit 5e19f42 into nnstreamer:main Nov 4, 2025
25 of 26 checks passed

myungjoo reviewed Nov 4, 2025

View reviewed changes

nntrainer/tensor/cpu_backend/fallback/fallback.h Show resolved Hide resolved

myungjoo reviewed Nov 4, 2025

View reviewed changes

skykongkong8 deleted the poc/arm/kai/i8mmkernel+weightofflinepackingdetached branch November 4, 2025 08:57

[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519

Uh oh!

Conversation

skykongkong8 commented Oct 2, 2025

Dependency of the PR

Summary

Uh oh!

github-actions bot commented Oct 25, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

jijoongmoon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

myungjoo Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myungjoo Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

myungjoo Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

skykongkong8 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skykongkong8 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myungjoo Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

skykongkong8 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

myungjoo Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skykongkong8 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

myungjoo Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

skykongkong8 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

myungjoo Nov 4, 2025 •

edited

Loading

skykongkong8 Nov 4, 2025 •

edited

Loading

skykongkong8 Nov 4, 2025 •

edited

Loading

myungjoo Nov 4, 2025 •

edited

Loading

skykongkong8 Nov 4, 2025 •

edited

Loading