Skip to content

Conversation

gkisalapl
Copy link
Contributor

@gkisalapl gkisalapl commented Aug 12, 2025

Optimize OpenCL Addition

Unify opencl addition for FP16/32
Add possibility to pass nullptr as local work size

Speedup:

Before:
FP32 ADD : N: 12 C: 1 H: 2048 W: 2048

  • time : CPU = 14135 us
  • time : GPU = 114108 us

After:
FP32 ADD : N: 12 C: 1 H: 2048 W: 2048

  • time : CPU = 13866 us
  • time : GPU = 74545 us

Comment on lines 197 to 201
* @param kernel OpenCL kernel
* @param global_work_size Total number of work items that will execute the
* kernel function
* @param local_work_size Number of work items that make up a work group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of the work_dim parameter is missing.

Copy link
Contributor

@djeong20 djeong20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments and implement the recommended changes.

return false;
}

clFinish(command_queue_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use clWaitForEvents instead.

Comment on lines 272 to 274
blas_cc->command_queue_inst_.enqueueSVMMap(const_cast<float *>(input),
dim1_size, false);
blas_cc->command_queue_inst_.enqueueSVMMap(res, dim2_size, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. SVMMap is used to enable the host side to use the allocated SVM.

Suggested change
blas_cc->command_queue_inst_.enqueueSVMMap(const_cast<float *>(input),
dim1_size, false);
blas_cc->command_queue_inst_.enqueueSVMMap(res, dim2_size, false);
blas_cc->command_queue_inst_.enqueueSVMUnmap(input);
blas_cc->command_queue_inst_.enqueueSVMUnmap(res);

Comment on lines 332 to 336
if (use_svm) {
blas_cc->command_queue_inst_.enqueueSVMUnmap(const_cast<float *>(input));
blas_cc->command_queue_inst_.enqueueSVMUnmap(res);
} else {
auto &clbuffInstance = ClBufferManager::Global();
result = clbuffInstance.getOutBufferA()->ReadDataRegion(
blas_cc->command_queue_inst_, dim2_size, res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to map or read data from a buffer after execution, there is no need to use DispatchCommandAndWait. This means that clFinish or clWaitForEvents is unnecessary.

@gkisalapl gkisalapl force-pushed the add_cl branch 2 times, most recently from e8b01eb to c8f25c1 Compare August 22, 2025 07:09

} while (false);
std::array<size_t, 3> global_work_size = {dim, 1, 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/home/m.wlasiuk/Desktop/nntrainer-mw/nntrainer/layers/cl_layers/swiglu_cl.cpp:161:45: error: narrowing conversion of ‘dim’ from ‘int’ to ‘long unsigned int’ [-Werror=narrowing]
161 | std::array<size_t, 3> global_work_size = {dim, 1, 1};

nullptr, 0, nullptr, &swiglu_wait)) {
}

cl_context->command_queue_inst_.waitForEvent(1, &swiglu_wait);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release events after use

@mwlasiuk
Copy link
Contributor

[==========] Running 19 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 19 tests from blas_kernels
[ RUN      ] blas_kernels.dotCL_sgemv_M_1_1
[       OK ] blas_kernels.dotCL_sgemv_M_1_1 (874 ms)
[ RUN      ] blas_kernels.dotCL_sgemv_M_1_2
[       OK ] blas_kernels.dotCL_sgemv_M_1_2 (42 ms)
[ RUN      ] blas_kernels.dotCL_sgemv_N_1_1
[       OK ] blas_kernels.dotCL_sgemv_N_1_1 (38 ms)
[ RUN      ] blas_kernels.dotCL_sgemv_N_1_2
[       OK ] blas_kernels.dotCL_sgemv_N_1_2 (39 ms)
[ RUN      ] blas_kernels.dotCL_sgemv_n
[       OK ] blas_kernels.dotCL_sgemv_n (34 ms)
[ RUN      ] blas_kernels.dotCL_sgemv_N_1_M_1_1
[       OK ] blas_kernels.dotCL_sgemv_N_1_M_1_1 (4 ms)
[ RUN      ] blas_kernels.dotCL_sgemv_N_1_M_1_2
[       OK ] blas_kernels.dotCL_sgemv_N_1_M_1_2 (0 ms)
[ RUN      ] blas_kernels.dot_gemm_50_768_1024_noTrans
[       OK ] blas_kernels.dot_gemm_50_768_1024_noTrans (44 ms)
[ RUN      ] blas_kernels.dot_gemm_50_768_2048_transB
[       OK ] blas_kernels.dot_gemm_50_768_2048_transB (88 ms)
[ RUN      ] blas_kernels.dot_gemm_50_768_1024_transA
[       OK ] blas_kernels.dot_gemm_50_768_1024_transA (25 ms)
[ RUN      ] blas_kernels.dot_gemm_50_768_2048_transAB
[       OK ] blas_kernels.dot_gemm_50_768_2048_transAB (40 ms)
[ RUN      ] blas_kernels.addition_i
FP32 ADD : N: 1 C: 1 H: 3072 W: 3072
 - time : CPU = 4502 us
 - time : GPU = 64 us
[       OK ] blas_kernels.addition_i (452 ms)
[ RUN      ] blas_kernels.addition_i_svm
FP32 ADD : N: 1 C: 1 H: 3072 W: 3072
 - time : CPU = 5221 us
 - time : GPU = 5103 us
[       OK ] blas_kernels.addition_i_svm (383 ms)
[ RUN      ] blas_kernels.l2norm
[       OK ] blas_kernels.l2norm (13 ms)
[ RUN      ] blas_kernels.absolute_sum
[       OK ] blas_kernels.absolute_sum (0 ms)
[ RUN      ] blas_kernels.rmsnorm_fp32
FP32 RMSNORM : N: 1 C: 1 H: 3072 W: 3072
 - time : CPU = 2747898 us
 - time : GPU = 873 us
[       OK ] blas_kernels.rmsnorm_fp32 (2957 ms)
[ RUN      ] blas_kernels.rmsnorm_fp32_svm
FP32 RMSNORM : N: 1 C: 1 H: 3072 W: 3072
 - time : CPU = 2702493 us
 - time : GPU = 156278 us
[       OK ] blas_kernels.rmsnorm_fp32_svm (3022 ms)
[ RUN      ] blas_kernels.swiglu_layer_fp32
FP32 SWIGLU : N: 1 C: 1 H: 3072 W: 3072
 - time : CPU = 296745 us
 - time : GPU = 147 us
[       OK ] blas_kernels.swiglu_layer_fp32 (570 ms)
[ RUN      ] blas_kernels.swiglu_layer_fp32_svm
FP32 SWIGLU : N: 1 C: 1 H: 3072 W: 3072
 - time : CPU = 419297 us
 - time : GPU = 35503 us
[       OK ] blas_kernels.swiglu_layer_fp32_svm (672 ms)
[----------] 19 tests from blas_kernels (9314 ms total)

[----------] Global test environment tear-down
[==========] 19 tests from 1 test suite ran. (9314 ms total)
[  PASSED  ] 19 tests.

set_result &= kernel->SetKernelArguments(0, &bufferInA, sizeof(cl_mem));
set_result &= kernel->SetKernelArguments(1, &bufferInB, sizeof(cl_mem));
set_result &= kernel->SetKernelArguments(2, &bufferOutA, sizeof(cl_mem));
if (!set_result) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ml_loge message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should change SetKernelArguments to throw in case of error instead of return bool in some next PR. This way we will avoid need of checking ret value all the time


if (!cl_context->command_queue_inst_.enqueueKernel(
kernel->GetKernel(), global_work_size.size(), global_work_size.data(),
nullptr, 0, nullptr, &swiglu_wait)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return + message

if (!use_svm) {
auto &clbuffInstance = ClBufferManager::Global();
if (!clbuffInstance.getOutBufferA()->ReadDataRegion(
cl_context->command_queue_inst_, dim * sizeof(float), vecYdata)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read data region already log error messages

if (!result) {
return;
}
if (use_svm) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from here on - message if failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should change SetKernelSVMArguments to throw in case of error instead of return bool in some next PR. This way we will avoid need of checking ret value all the time

if (!result) {
return;
}

result = clbuffInstance.getOutBufferA()->ReadDataRegion(
blas_cc->command_queue_inst_, dim2_size, res);
blas_cc->command_queue_inst_.waitForEvent(1, &addition_wait);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release event

@gkisalapl gkisalapl force-pushed the add_cl branch 8 times, most recently from 56e8460 to d110098 Compare August 26, 2025 06:39
Comment on lines +215 to +218
typedef cl_int(CL_API_CALL *PFN_clWaitForEvents)(cl_uint num_events,
const cl_event *event_list);

typedef cl_int(CL_API_CALL *PFN_clReleaseEvent)(cl_event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous declarations have parameter names commented out. These 2 - one has type + name and second has only type for parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added only clReleaseEvent function and I would suggest to stick with types only in the future

Unify opencl addition for FP16/32
Add possibility to pass nullptr as local work size
Add SVM option

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Grzegorz Kisala <[email protected]>
Add flag to indicate that memory pool data was created
using SVM

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Grzegorz Kisala <[email protected]>
Modify swiglu and rmsnorm cl implementations to work with both
svm allocated tensors and buffers

**Self-evaluation:**
1. Build test: [X]Passed [ ]Failed [ ]Skipped
2. Run test:   [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Grzegorz Kisala <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants