Commit 15c44a6
Adding dynamic thread-setting logic for CGEMM(AOCL_DYNAMIC) (#48)
- Added a set of thresholds(based on input dimensions) that
determine and set the ideal number of threads to be used
for CGEMM (on ZEN4 and ZEN5 architectures).
- The thread-setting logic is as follows :
- The underlying kernels(single-threaded) work on blocks
of MRxk of A, kxNR of B and MRxNR of C. Thus, it is
initially assumed that the optimal number of threads is
ceil(m/MR)*ceil(n/NR). This is the upper bound on the
actual number of threads that is ideal.
- The actual ideal thread count could be lesser than the
upper bound, based on the work that every thread receives.
This is mainly determined by the value of 'k'.
- If 'k' is small, the arithmetic intensity(AI) is low and
memory bandwidth becomes the limiting factor, thus favoring
smaller thread counts. In contrast, if 'k' is high, the AI
is high and the workload scales well with higher thread counts.
- So, we limit the number of threads when 'k' is small to avoid
bandwidth contention. Using fewer threads ensures each thread
gets more bandwidth, improving efficiency. In contrast, we allow
more threads when 'k' is large, as the computation becomes more
compute-bound and less limited by memory bandwidth, thereby benefitting
with a higher-thread count.
- The new logic will now set the upper bound for the optimal number of threads
(based on the number of tiles), and then further reduce it based on the values
of 'm', 'n' and 'k'. This comes under the 'AOCL_DYNAMIC' feature for CGEMM,
specifically for ZEN4 and ZEN5 architectures.
AMD-Internal: [CPUPL-6498]
Co-authored-by: Vignesh Balasubramanian <[email protected]>
Co-authored-by: Varaganti, Kiran <[email protected]>1 parent c81408c commit 15c44a6
1 file changed
+686
-0
lines changed
0 commit comments