- 
                Notifications
    
You must be signed in to change notification settings  - Fork 145
 
[MIOpen] Check numeric increase grid size to hide memory latencies #2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2067a63    to
    e4144a6      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Would you mind repeating the comparison with one of the very small shapes to check for any slowdown from all the unneeded warps?
          
 New revision contains adaptive grid size calculation based on amount of work and hardware occupancy. For small tensor (3×3×512×512 = 2.3M elements): 
 Result: Uses work-based value (2,304) since it already exceeds minimum occupancy. For VERY small tensor (10K elements): 
 Result: Overrides work-based to ensure minimum 104 blocks to keep all CUs busy. The  
 Without this floor, a 10K element tensor would only launch 10 blocks, leaving 94 of 104 CUs completely idle! Then  Together they create: numBlocks ∈ [104, 6,656] with work-based preference in between.  | 
    
| 
           This looks very good: 212 ms to 10 ms is amazing. How long does the convolution take? Just wondering how this time compares relatively?  | 
    
| 
           An example of convolution with large tensor takes ~2 ms to run and on latest commit check numeric takes 1.3 - 1.8 ms compared to 210-280 ms originally  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
ef9cf15    to
    3e3e0ca      
    Compare
  
    3e3e0ca    to
    48de74e      
    Compare
  
    [MIOpen] Check numeric increase grid size to hide memory latencies (#2341) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation For large tensor dimensions there was significant SQ_WAIT_ANY cycles due to small grid size. rocprof showed almost ~75-85% of cycles are wait cycles. ## Technical Details Take convolution with `./bin/MIOpenDriver conv -n 3 -c 256 -H 512 -W 512` 1. Current: `CUs * 6` = 624 blocks Grid size disparity for conv vs checknumeric kernel - check_numerics: 624 workgroups = 159,744 threads - igemm_wrw: 105,984 workgroups = 27,131,904 threads - **170x** difference in parallelism ``` MIOpenDriver conv -n 3 -c 256 -H 512 -W 512 -k 192 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -i 1 -V 0 ROCProfilerV2: Collecting the following counters: - SQ_CYCLES Enabling Counter Collection PRNG seed: 12345678 Timestamp: 2025-10-28 22:17:26 UTC; Host Name: 44cf089bd69b; Operating System: Linux 6.5.0-15-generic; ROCm: 6.4.43484; MIOpen Driver: 3.5.1; CPU Vendor: AMD; CPU Model: 2 x EPYC 7513; RAM Size: 503 GB; GPU Model: 4 x AMD Instinct MI210; AMDGPU Driver: 6.10.10 MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC GPU Kernel Time Backward Weights Conv. Elapsed: 0.060160 ms (average) stats: name, n, c, ho, wo, y, x, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: bwdw-conv1x1u1, 3, 256, 512, 512, 1, 1, 192, 77309411328, 0, 0, 1285063, 0, 0.060160 Dispatch_ID(0), GPU_ID(2), Queue_ID(2), Process_ID(19922), Thread_ID(19922), Grid_Size(26624), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(12), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_fillBufferAligned.kd"), Begin_Timestamp(19804846497731734), End_Timestamp(19804846497755254), Correlation_ID(0), SQ_CYCLES(419824.000000) Dispatch_ID(1), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849212826632), End_Timestamp(19804849212835752), Correlation_ID(0), SQ_CYCLES(226144.000000) Dispatch_ID(2), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849214642534), End_Timestamp(19804849428060962), Correlation_ID(0), SQ_CYCLES(2877000960.000000) Dispatch_ID(3), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849428234880), End_Timestamp(19804849428242720), Correlation_ID(0), SQ_CYCLES(207024.000000) Dispatch_ID(4), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849428369600), End_Timestamp(19804849428376160), Correlation_ID(0), SQ_CYCLES(182000.000000) Dispatch_ID(5), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849428465467), End_Timestamp(19804849714608798), Correlation_ID(0), SQ_CYCLES(3857239704.000000) Dispatch_ID(6), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849714693276), End_Timestamp(19804849714699516), Correlation_ID(0), SQ_CYCLES(184768.000000) Dispatch_ID(7), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(65536), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("SubTensorOpWithScalar1d.kd"), Begin_Timestamp(19804849716538071), End_Timestamp(19804849716541111), Correlation_ID(0), SQ_CYCLES(140752.000000) Dispatch_ID(8), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(50331648), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19804849716604631), End_Timestamp(19804849719992301), Correlation_ID(0), SQ_CYCLES(44721312.000000) Dispatch_ID(9), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(37748736), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19804849720050701), End_Timestamp(19804849722710853), Correlation_ID(0), SQ_CYCLES(35152416.000000) Dispatch_ID(10), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(105984), Workgroup_Size(256), LDS_Per_Workgroup(16384), Scratch_Per_Workitem(0), Arch_VGPR(56), Accum_VGPR(32), SGPR(80), Wave_Size(64), Kernel_Name("igemm_wrw_gtcx2_nhwc_fp32_bx0_ex0_bt64x128x16_wt32x32x2_ws1x1_wr1x2_ta1x1x1x4_1x16x1x16_tb1x1x1x8_1x16x1x16_gkgs.kd"), Begin_Timestamp(19804849722768293), End_Timestamp(19804849724910367), Correlation_ID(0), SQ_CYCLES(24566288.000000) Dispatch_ID(11), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849725003486), End_Timestamp(19804849725010846), Correlation_ID(0), SQ_CYCLES(190768.000000) Dispatch_ID(12), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849725080126), End_Timestamp(19804849725140286), Correlation_ID(0), SQ_CYCLES(881128.000000) Dispatch_ID(13), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849725197726), End_Timestamp(19804849725205086), Correlation_ID(0), SQ_CYCLES(192744.000000) ``` 2. Proposed: Adaptive Gride Size __numBlocks ∈ [104, 53,248]__ Occupancy: - Waves per CU = (53,248 blocks × 4 waves/block) / 104 CUs = 2048 waves/CU - The scheduler keeps CUs fed to better hide memory latencies and fully utilize available bandwidth Work Distribution: - __For small tensor (3×3×512×512 = 2.3M elements):__ 1. __minBlocks__ = 2,359,296 / 4 / 256 = __2,304__ (work needed) 2. __minBlocksForOccupancy__ = (104 × 4) / 4 = __104__ (min for occupancy) 3. __std::max(2,304, 104) = 2,304__ ✓ Result: Uses work-based value (2,304) since it already exceeds minimum occupancy. __For VERY small tensor (10K elements):__ 1. __minBlocks__ = 10,000 / 4 / 256 = __10__ (only 10 blocks needed for work) 2. __minBlocksForOccupancy__ = __104__ 3. __std::max(10, 104) = 104__ ✓ Result: Overrides work-based to ensure __minimum 104 blocks__ to keep all CUs busy. - Medium tensors (10M elements): ~6 elements/thread - Large tensors (200M): ~118 elements/thread The `std::max` acts as a __floor function__: - If tensor is large enough → use work-based value (efficient) - If tensor is too small → enforce minimum occupancy (prevents idle CUs) Without this floor, a 10K element tensor would only launch 10 blocks, leaving 94 of 104 CUs completely idle! __Then `std::min(..., maxBlocksForOccupancy)` caps at ceiling__ to prevent over-subscription for huge tensors. Together they create: __numBlocks ∈ [104, 53,248]__ with work-based preference in between. ``` MIOpenDriver conv -n 3 -c 256 -H 512 -W 512 -k 192 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -i 1 -V 0 ROCProfilerV2: Collecting the following counters: - SQ_CYCLES Enabling Counter Collection PRNG seed: 12345678 Timestamp: 2025-10-28 23:46:09 UTC; Host Name: 44cf089bd69b; Operating System: Linux 6.5.0-15-generic; ROCm: 6.4.43484; MIOpen Driver: 3.5.1; CPU Vendor: AMD; CPU Model: 2 x EPYC 7513; RAM Size: 503 GB; GPU Model: 4 x AMD Instinct MI210; AMDGPU Driver: 6.10.10 MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC GPU Kernel Time Backward Weights Conv. Elapsed: 0.009760 ms (average) stats: name, n, c, ho, wo, y, x, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: bwdw-conv1x1u1, 3, 256, 512, 512, 1, 1, 192, 77309411328, 0, 0, 7921046, 0, 0.009760 Dispatch_ID(0), GPU_ID(2), Queue_ID(2), Process_ID(21175), Thread_ID(21175), Grid_Size(26624), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(12), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_fillBufferAligned.kd"), Begin_Timestamp(19810169838401376), End_Timestamp(19810169838424736), Correlation_ID(0), SQ_CYCLES(419736.000000) Dispatch_ID(1), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172575993557), End_Timestamp(19810172576002517), Correlation_ID(0), SQ_CYCLES(218648.000000) Dispatch_ID(2), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172577751471), End_Timestamp(19810172588328406), Correlation_ID(0), SQ_CYCLES(142129368.000000) Dispatch_ID(3), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172588405046), End_Timestamp(19810172588412566), Correlation_ID(0), SQ_CYCLES(202024.000000) Dispatch_ID(4), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172588510326), End_Timestamp(19810172588517846), Correlation_ID(0), SQ_CYCLES(194360.000000) Dispatch_ID(5), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172588581046), End_Timestamp(19810172602666772), Correlation_ID(0), SQ_CYCLES(189249576.000000) Dispatch_ID(6), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172602725972), End_Timestamp(19810172602733492), Correlation_ID(0), SQ_CYCLES(202072.000000) Dispatch_ID(7), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(65536), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("SubTensorOpWithScalar1d.kd"), Begin_Timestamp(19810172604346288), End_Timestamp(19810172604349328), Correlation_ID(0), SQ_CYCLES(143176.000000) Dispatch_ID(8), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(50331648), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19810172604408688), End_Timestamp(19810172607826280), Correlation_ID(0), SQ_CYCLES(45095960.000000) Dispatch_ID(9), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(37748736), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19810172607882759), End_Timestamp(19810172610592513), Correlation_ID(0), SQ_CYCLES(35806632.000000) Dispatch_ID(10), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(105984), Workgroup_Size(256), LDS_Per_Workgroup(16384), Scratch_Per_Workitem(0), Arch_VGPR(56), Accum_VGPR(32), SGPR(80), Wave_Size(64), Kernel_Name("igemm_wrw_gtcx2_nhwc_fp32_bx0_ex0_bt64x128x16_wt32x32x2_ws1x1_wr1x2_ta1x1x1x4_1x16x1x16_tb1x1x1x8_1x16x1x16_gkgs.kd"), Begin_Timestamp(19810172610649793), End_Timestamp(19810172612821788), Correlation_ID(0), SQ_CYCLES(24669752.000000) Dispatch_ID(11), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172612916507), End_Timestamp(19810172612924027), Correlation_ID(0), SQ_CYCLES(187768.000000) Dispatch_ID(12), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172612994747), End_Timestamp(19810172613004507), Correlation_ID(0), SQ_CYCLES(222184.000000) Dispatch_ID(13), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172613059707), End_Timestamp(19810172613067067), Correlation_ID(0), SQ_CYCLES(192920.000000) ``` ## Test Plan Captured rocprof data showing the optimization improved performance by **20×**, reducing execution time from 200 ms to 10 ms. <img width="749" height="1119" alt="image" src="https://github.com/user-attachments/assets/bcf079ef-10bc-4fca-b301-aa65b133a669" /> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Motivation
For large tensor dimensions there was significant SQ_WAIT_ANY cycles due to small grid size. rocprof showed almost ~75-85% of cycles are wait cycles.
Technical Details
Take convolution with
./bin/MIOpenDriver conv -n 3 -c 256 -H 512 -W 512CUs * 6= 624 blocksGrid size disparity for conv vs checknumeric kernel
Occupancy:
Work Distribution:
Result: Uses work-based value (2,304) since it already exceeds minimum occupancy.
For VERY small tensor (10K elements):
Result: Overrides work-based to ensure minimum 104 blocks to keep all CUs busy.
The
std::maxacts as a floor function:Without this floor, a 10K element tensor would only launch 10 blocks, leaving 94 of 104 CUs completely idle!
Then
std::min(..., maxBlocksForOccupancy)caps at ceiling to prevent over-subscription for huge tensors.Together they create: numBlocks ∈ [104, 53,248] with work-based preference in between.
Test Plan
Captured rocprof data showing the optimization improved performance by 20×, reducing execution time from 200 ms to 10 ms.
Test Result
Submission Checklist