Skip to content

Conversation

@skykongkong8
Copy link
Member

Dependency of the PR

None.

Summary

This Pull Request introduces even more optimized version of qsi4cxp_qs4cxs1s0 GEMM first propsed from #3497
Optimization technique s.t.:

  • neon SIMD
  • openMP based multithreading
  • automatic ukernel selection

As a result, this patch accelerated approximately x4 times faster GEMM computation latency:

// Tested on Galaxy S25U
[ RUN      ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048
BEFORE : 8639844 ns 8639 us 8 ms
AFTER : 2524531 ns 2524 us 2 ms

In my inspection, this is more than 2~3 times faster than previous Q4_0
FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms

For practical using sample, refer to unittest_nntrainer_cpu_backend_fp16.cpp

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>

@github-actions
Copy link

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 3 days.

@github-actions github-actions bot added the Stale label Oct 25, 2025
@github-actions
Copy link

This PR was closed because it has been stalled for 3 days with no activity.

@github-actions github-actions bot closed this Oct 28, 2025
@skykongkong8 skykongkong8 reopened this Nov 3, 2025
- armv8.2-a+fp16+dotprod -> armv8.2-a+fp16+dotprod+i8mm
- Adding i8mm enables including high-performancing SIMD intrinsics

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- note : for cpu backend interface, need to add fallback function for this...

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Assume the weight is offline-packed in qs4cxs1s0 manner, with its optimal ukernel idx
- unittest TC says:
[INFO] sgemm :    387812 ns 387 us 0 ms
[INFO] test_gemm_qai8dxp_qsi4cxp_packed: 16667 ns 16 us 0 ms
[INFO] MSE: 0.554387, COS_SIM: 0.998757, MAX_DIFFER: 3.13451, SUM: 267.005, SUM_GT: 300.489

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…s1s0 format

- todo: automatically returns optimal kernel variant idx and feed it to packed-TC

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Multithread with openMP, w.r.t. N-direction, coarse grained
// Tested on Galaxy S23
- BEFORE
test_gemm_qai8dxp_qsi4cxp_packed: 6934427 ns 6934 us 6 ms
- AFTER
test_gemm_qai8dxp_qsi4cxp_packed: 4398489 ns 4398 us 4 ms

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- optimal kernel idx is not always consistent among the run.
- In heuristic pov, I could get optimal kernel idx with multiple run, and chose the most frequently occuring one. (17 / 20)

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- todo: support non-4-divisible-N case
[ RUN      ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048
BEFORE : 8639844 ns 8639 us 8 ms
AFTER : 2524531 ns 2524 us 2 ms

FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- nntr_gemm_qai8dxp_qsi4cxp_packed
- nntr_qsi4cxp_qs4cxs1s0_rhs_pack
- nntr_get_rhs_packed_size_qsi4cxp_qs4cxs1s0

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.h
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm.h
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm.h
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- kleidiai-based functions basically support fallback implementations, but some functions for specific purposes are only used in ARM.
- Still for easier maintenance of cpu_backend, those function headers should be declared in the other sides as well. (Needs some other opinions though)
- trivial doxygentags

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Note that this operations expects:
1. Weight is transposed,
2. Weight is quantized in channel-wise scheme,
3. Weight is packed in (GEMV) 1 or (GEMM) 5 idx number,
4. Activation is FP32 (for it is implemented on float_tensor.cpp)

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 force-pushed the poc/arm/kai/i8mmkernel+weightofflinepackingdetached branch from 31bbb8e to 3354330 Compare November 3, 2025 00:23
@github-actions github-actions bot removed the Stale label Nov 3, 2025
Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jijoongmoon jijoongmoon merged commit 5e19f42 into nnstreamer:main Nov 4, 2025
25 of 26 checks passed
size_t
kai_get_m_step_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm(void) {
return kai_m_step;
}
Copy link
Member

@myungjoo myungjoo Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function really exported to other modules?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this value is accessed by external modules in the middle of critical calculation (performance impacting), you can define this functin as a static-inline function in the header and have kai_m_step declared in the header. Moreoever, for conventional C++ principles, this is better to be a static value of a public class, removing all the worries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I worry: calling this function in a loop or other module.

Copy link
Member Author

@skykongkong8 skykongkong8 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your worries and let me explain.

First of all, kai_get_m_step_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm is not actually being called from anywhere in the current status.
Then why should we keep this function? -> Because it is component of kai_matmul_ukernel_f32_qa8dxp_qs4cxp struct, which is general format of using kleidiai GEMM kernels, and we imported only some of them, and planning to introduce more in the future. In that case, we might need some m_step calling functions, but even in such case, it will not be called in a loop, or in other modules except for neon_kleidiai.cpp

Copy link
Member Author

@skykongkong8 skykongkong8 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are worried about the latency-perspective, I think it won't affect the total GEMM latency meaningfully...
However, if you are worried about SW-design perspective, maybe we should undergo major refactoring for detaching all

struct kai_matmul_clamp_f32_qai8dxp_qsi4cxp_ukernel {
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_m_step_func_t get_m_step;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_n_step_func_t get_n_step;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_mr_func_t get_mr;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_nr_func_t get_nr;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_nr_func_t get_kr;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_sr_func_t get_sr;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_lhs_packed_offset_func_t
    get_lhs_packed_offset;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_rhs_packed_offset_func_t
    get_rhs_packed_offset;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_dst_offset_func_t get_dst_offset;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_dst_size_func_t get_dst_size;
  kai_matmul_clamp_f32_qai8dxp_qsi4cxp_run_matmul_func_t run_matmul;
};

struct into static value of a public class since kleidiai is getting some const values with just normal functions.

size_t m, size_t n, size_t k, void *lhs_native_mtx_f32,
void *rhs_native_mtx_qs4cx, void *rhs_scales_f32, float *dst_mtx_f32,
bool transB, float lower_bound, float upper_bound) {
__fallback_nntr_gemm_qai8dxp_qsi4cxp_unpacked(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Sorry for trivial misses!

#ifdef ENABLE_FP16
if (input.q_scheme() == QScheme::PER_CHANNEL_AFFINE) {
uint32_t opt_kernel_idx = (M == 1) ? 1 : 5;
nntr_gemm_qai8dxp_qsi4cxp_packed(
Copy link
Member

@myungjoo myungjoo Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return statement?

throw of LINE 983 is always called after this!

Copy link
Member Author

@skykongkong8 skykongkong8 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I missed this one! It should be

#else
  throw std::runtime_error(
    "Error: FP16 should be enabled for QINT4 Dot on CPU");
 #endif

Not

#endif
  throw std::runtime_error(
    "Error: FP16 should be enabled for QINT4 Dot on CPU");

* @param rhs_scales_f32 matrix quant scale after quantization to stroe
* @param transB
*/
void nntr_quant_qs4cx_f32(size_t n, size_t k, void *rhs_native_mtx_f32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double definition? (not built in our github action scenarios?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your thorough review! I wonder why this problem haven't detected previously..!

@skykongkong8 skykongkong8 deleted the poc/arm/kai/i8mmkernel+weightofflinepackingdetached branch November 4, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants