-
Notifications
You must be signed in to change notification settings - Fork 96
[ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 3 days. |
|
This PR was closed because it has been stalled for 3 days with no activity. |
- armv8.2-a+fp16+dotprod -> armv8.2-a+fp16+dotprod+i8mm - Adding i8mm enables including high-performancing SIMD intrinsics **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- note : for cpu backend interface, need to add fallback function for this... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Assume the weight is offline-packed in qs4cxs1s0 manner, with its optimal ukernel idx - unittest TC says: [INFO] sgemm : 387812 ns 387 us 0 ms [INFO] test_gemm_qai8dxp_qsi4cxp_packed: 16667 ns 16 us 0 ms [INFO] MSE: 0.554387, COS_SIM: 0.998757, MAX_DIFFER: 3.13451, SUM: 267.005, SUM_GT: 300.489 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…s1s0 format - todo: automatically returns optimal kernel variant idx and feed it to packed-TC **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
**Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Multithread with openMP, w.r.t. N-direction, coarse grained // Tested on Galaxy S23 - BEFORE test_gemm_qai8dxp_qsi4cxp_packed: 6934427 ns 6934 us 6 ms - AFTER test_gemm_qai8dxp_qsi4cxp_packed: 4398489 ns 4398 us 4 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- optimal kernel idx is not always consistent among the run. - In heuristic pov, I could get optimal kernel idx with multiple run, and chose the most frequently occuring one. (17 / 20) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- todo: support non-4-divisible-N case [ RUN ] nntrainer_cpu_backend_standalone.qai8dxp_qsi4cxp_512x768x2048 BEFORE : 8639844 ns 8639 us 8 ms AFTER : 2524531 ns 2524 us 2 ms FYI) gemm_q4_0: 6986094 ns 6986 us 6 ms **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- nntr_gemm_qai8dxp_qsi4cxp_packed - nntr_qsi4cxp_qs4cxs1s0_rhs_pack - nntr_get_rhs_packed_size_qsi4cxp_qs4cxs1s0 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm.h - kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- kleidiai-based functions basically support fallback implementations, but some functions for specific purposes are only used in ARM. - Still for easier maintenance of cpu_backend, those function headers should be declared in the other sides as well. (Needs some other opinions though) - trivial doxygentags **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
Note that this operations expects: 1. Weight is transposed, 2. Weight is quantized in channel-wise scheme, 3. Weight is packed in (GEMV) 1 or (GEMM) 5 idx number, 4. Activation is FP32 (for it is implemented on float_tensor.cpp) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
31bbb8e to
3354330
Compare
jijoongmoon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| size_t | ||
| kai_get_m_step_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm(void) { | ||
| return kai_m_step; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function really exported to other modules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this value is accessed by external modules in the middle of critical calculation (performance impacting), you can define this functin as a static-inline function in the header and have kai_m_step declared in the header. Moreoever, for conventional C++ principles, this is better to be a static value of a public class, removing all the worries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I worry: calling this function in a loop or other module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your worries and let me explain.
First of all, kai_get_m_step_matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm is not actually being called from anywhere in the current status.
Then why should we keep this function? -> Because it is component of kai_matmul_ukernel_f32_qa8dxp_qs4cxp struct, which is general format of using kleidiai GEMM kernels, and we imported only some of them, and planning to introduce more in the future. In that case, we might need some m_step calling functions, but even in such case, it will not be called in a loop, or in other modules except for neon_kleidiai.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are worried about the latency-perspective, I think it won't affect the total GEMM latency meaningfully...
However, if you are worried about SW-design perspective, maybe we should undergo major refactoring for detaching all
struct kai_matmul_clamp_f32_qai8dxp_qsi4cxp_ukernel {
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_m_step_func_t get_m_step;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_n_step_func_t get_n_step;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_mr_func_t get_mr;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_nr_func_t get_nr;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_nr_func_t get_kr;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_sr_func_t get_sr;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_lhs_packed_offset_func_t
get_lhs_packed_offset;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_rhs_packed_offset_func_t
get_rhs_packed_offset;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_dst_offset_func_t get_dst_offset;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_get_dst_size_func_t get_dst_size;
kai_matmul_clamp_f32_qai8dxp_qsi4cxp_run_matmul_func_t run_matmul;
};struct into static value of a public class since kleidiai is getting some const values with just normal functions.
| size_t m, size_t n, size_t k, void *lhs_native_mtx_f32, | ||
| void *rhs_native_mtx_qs4cx, void *rhs_scales_f32, float *dst_mtx_f32, | ||
| bool transB, float lower_bound, float upper_bound) { | ||
| __fallback_nntr_gemm_qai8dxp_qsi4cxp_unpacked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! Sorry for trivial misses!
| #ifdef ENABLE_FP16 | ||
| if (input.q_scheme() == QScheme::PER_CHANNEL_AFFINE) { | ||
| uint32_t opt_kernel_idx = (M == 1) ? 1 : 5; | ||
| nntr_gemm_qai8dxp_qsi4cxp_packed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing return statement?
throw of LINE 983 is always called after this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I missed this one! It should be
#else
throw std::runtime_error(
"Error: FP16 should be enabled for QINT4 Dot on CPU");
#endifNot
#endif
throw std::runtime_error(
"Error: FP16 should be enabled for QINT4 Dot on CPU");| * @param rhs_scales_f32 matrix quant scale after quantization to stroe | ||
| * @param transB | ||
| */ | ||
| void nntr_quant_qs4cx_f32(size_t n, size_t k, void *rhs_native_mtx_f32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double definition? (not built in our github action scenarios?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your thorough review! I wonder why this problem haven't detected previously..!
Dependency of the PR
None.
Summary
This Pull Request introduces even more optimized version of qsi4cxp_qs4cxs1s0 GEMM first propsed from #3497
Optimization technique s.t.:
As a result, this patch accelerated approximately x4 times faster GEMM computation latency:
In my inspection, this is more than 2~3 times faster than previous
Q4_0FYI)
gemm_q4_0: 6986094 ns 6986 us 6 msFor practical using sample, refer to
unittest_nntrainer_cpu_backend_fp16.cpp