-
Notifications
You must be signed in to change notification settings - Fork 92
Optimize OpenCL Addition #3408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize OpenCL Addition #3408
Conversation
* @param kernel OpenCL kernel | ||
* @param global_work_size Total number of work items that will execute the | ||
* kernel function | ||
* @param local_work_size Number of work items that make up a work group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description of the work_dim parameter is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the comments and implement the recommended changes.
return false; | ||
} | ||
|
||
clFinish(command_queue_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use clWaitForEvents
instead.
blas_cc->command_queue_inst_.enqueueSVMMap(const_cast<float *>(input), | ||
dim1_size, false); | ||
blas_cc->command_queue_inst_.enqueueSVMMap(res, dim2_size, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. SVMMap is used to enable the host side to use the allocated SVM.
blas_cc->command_queue_inst_.enqueueSVMMap(const_cast<float *>(input), | |
dim1_size, false); | |
blas_cc->command_queue_inst_.enqueueSVMMap(res, dim2_size, false); | |
blas_cc->command_queue_inst_.enqueueSVMUnmap(input); | |
blas_cc->command_queue_inst_.enqueueSVMUnmap(res); |
if (use_svm) { | ||
blas_cc->command_queue_inst_.enqueueSVMUnmap(const_cast<float *>(input)); | ||
blas_cc->command_queue_inst_.enqueueSVMUnmap(res); | ||
} else { | ||
auto &clbuffInstance = ClBufferManager::Global(); | ||
result = clbuffInstance.getOutBufferA()->ReadDataRegion( | ||
blas_cc->command_queue_inst_, dim2_size, res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you plan to map or read data from a buffer after execution, there is no need to use DispatchCommandAndWait
. This means that clFinish
or clWaitForEvents
is unnecessary.
f44547f
to
4d83229
Compare
e8b01eb
to
c8f25c1
Compare
|
||
} while (false); | ||
std::array<size_t, 3> global_work_size = {dim, 1, 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/home/m.wlasiuk/Desktop/nntrainer-mw/nntrainer/layers/cl_layers/swiglu_cl.cpp:161:45: error: narrowing conversion of ‘dim’ from ‘int’ to ‘long unsigned int’ [-Werror=narrowing]
161 | std::array<size_t, 3> global_work_size = {dim, 1, 1};
nullptr, 0, nullptr, &swiglu_wait)) { | ||
} | ||
|
||
cl_context->command_queue_inst_.waitForEvent(1, &swiglu_wait); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release events after use
[==========] Running 19 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 19 tests from blas_kernels
[ RUN ] blas_kernels.dotCL_sgemv_M_1_1
[ OK ] blas_kernels.dotCL_sgemv_M_1_1 (874 ms)
[ RUN ] blas_kernels.dotCL_sgemv_M_1_2
[ OK ] blas_kernels.dotCL_sgemv_M_1_2 (42 ms)
[ RUN ] blas_kernels.dotCL_sgemv_N_1_1
[ OK ] blas_kernels.dotCL_sgemv_N_1_1 (38 ms)
[ RUN ] blas_kernels.dotCL_sgemv_N_1_2
[ OK ] blas_kernels.dotCL_sgemv_N_1_2 (39 ms)
[ RUN ] blas_kernels.dotCL_sgemv_n
[ OK ] blas_kernels.dotCL_sgemv_n (34 ms)
[ RUN ] blas_kernels.dotCL_sgemv_N_1_M_1_1
[ OK ] blas_kernels.dotCL_sgemv_N_1_M_1_1 (4 ms)
[ RUN ] blas_kernels.dotCL_sgemv_N_1_M_1_2
[ OK ] blas_kernels.dotCL_sgemv_N_1_M_1_2 (0 ms)
[ RUN ] blas_kernels.dot_gemm_50_768_1024_noTrans
[ OK ] blas_kernels.dot_gemm_50_768_1024_noTrans (44 ms)
[ RUN ] blas_kernels.dot_gemm_50_768_2048_transB
[ OK ] blas_kernels.dot_gemm_50_768_2048_transB (88 ms)
[ RUN ] blas_kernels.dot_gemm_50_768_1024_transA
[ OK ] blas_kernels.dot_gemm_50_768_1024_transA (25 ms)
[ RUN ] blas_kernels.dot_gemm_50_768_2048_transAB
[ OK ] blas_kernels.dot_gemm_50_768_2048_transAB (40 ms)
[ RUN ] blas_kernels.addition_i
FP32 ADD : N: 1 C: 1 H: 3072 W: 3072
- time : CPU = 4502 us
- time : GPU = 64 us
[ OK ] blas_kernels.addition_i (452 ms)
[ RUN ] blas_kernels.addition_i_svm
FP32 ADD : N: 1 C: 1 H: 3072 W: 3072
- time : CPU = 5221 us
- time : GPU = 5103 us
[ OK ] blas_kernels.addition_i_svm (383 ms)
[ RUN ] blas_kernels.l2norm
[ OK ] blas_kernels.l2norm (13 ms)
[ RUN ] blas_kernels.absolute_sum
[ OK ] blas_kernels.absolute_sum (0 ms)
[ RUN ] blas_kernels.rmsnorm_fp32
FP32 RMSNORM : N: 1 C: 1 H: 3072 W: 3072
- time : CPU = 2747898 us
- time : GPU = 873 us
[ OK ] blas_kernels.rmsnorm_fp32 (2957 ms)
[ RUN ] blas_kernels.rmsnorm_fp32_svm
FP32 RMSNORM : N: 1 C: 1 H: 3072 W: 3072
- time : CPU = 2702493 us
- time : GPU = 156278 us
[ OK ] blas_kernels.rmsnorm_fp32_svm (3022 ms)
[ RUN ] blas_kernels.swiglu_layer_fp32
FP32 SWIGLU : N: 1 C: 1 H: 3072 W: 3072
- time : CPU = 296745 us
- time : GPU = 147 us
[ OK ] blas_kernels.swiglu_layer_fp32 (570 ms)
[ RUN ] blas_kernels.swiglu_layer_fp32_svm
FP32 SWIGLU : N: 1 C: 1 H: 3072 W: 3072
- time : CPU = 419297 us
- time : GPU = 35503 us
[ OK ] blas_kernels.swiglu_layer_fp32_svm (672 ms)
[----------] 19 tests from blas_kernels (9314 ms total)
[----------] Global test environment tear-down
[==========] 19 tests from 1 test suite ran. (9314 ms total)
[ PASSED ] 19 tests. |
set_result &= kernel->SetKernelArguments(0, &bufferInA, sizeof(cl_mem)); | ||
set_result &= kernel->SetKernelArguments(1, &bufferInB, sizeof(cl_mem)); | ||
set_result &= kernel->SetKernelArguments(2, &bufferOutA, sizeof(cl_mem)); | ||
if (!set_result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing ml_loge message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should change SetKernelArguments to throw in case of error instead of return bool in some next PR. This way we will avoid need of checking ret value all the time
|
||
if (!cl_context->command_queue_inst_.enqueueKernel( | ||
kernel->GetKernel(), global_work_size.size(), global_work_size.data(), | ||
nullptr, 0, nullptr, &swiglu_wait)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing return + message
if (!use_svm) { | ||
auto &clbuffInstance = ClBufferManager::Global(); | ||
if (!clbuffInstance.getOutBufferA()->ReadDataRegion( | ||
cl_context->command_queue_inst_, dim * sizeof(float), vecYdata)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read data region already log error messages
if (!result) { | ||
return; | ||
} | ||
if (use_svm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from here on - message if failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should change SetKernelSVMArguments to throw in case of error instead of return bool in some next PR. This way we will avoid need of checking ret value all the time
if (!result) { | ||
return; | ||
} | ||
|
||
result = clbuffInstance.getOutBufferA()->ReadDataRegion( | ||
blas_cc->command_queue_inst_, dim2_size, res); | ||
blas_cc->command_queue_inst_.waitForEvent(1, &addition_wait); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release event
56e8460
to
d110098
Compare
typedef cl_int(CL_API_CALL *PFN_clWaitForEvents)(cl_uint num_events, | ||
const cl_event *event_list); | ||
|
||
typedef cl_int(CL_API_CALL *PFN_clReleaseEvent)(cl_event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous declarations have parameter names commented out. These 2 - one has type + name and second has only type for parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added only clReleaseEvent function and I would suggest to stick with types only in the future
Unify opencl addition for FP16/32 Add possibility to pass nullptr as local work size Add SVM option **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Grzegorz Kisala <[email protected]>
Add flag to indicate that memory pool data was created using SVM **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Grzegorz Kisala <[email protected]>
Modify swiglu and rmsnorm cl implementations to work with both svm allocated tensors and buffers **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Grzegorz Kisala <[email protected]>
Optimize OpenCL Addition
Unify opencl addition for FP16/32
Add possibility to pass nullptr as local work size
Speedup:
Before:
FP32 ADD : N: 12 C: 1 H: 2048 W: 2048
After:
FP32 ADD : N: 12 C: 1 H: 2048 W: 2048