Skip to content

Conversation

@AnaghaRaoAMD
Copy link
Contributor

@AnaghaRaoAMD AnaghaRaoAMD commented Oct 29, 2025

Motivation

For large tensor dimensions there was significant SQ_WAIT_ANY cycles due to small grid size. rocprof showed almost ~75-85% of cycles are wait cycles.

Technical Details

Take convolution with ./bin/MIOpenDriver conv -n 3 -c 256 -H 512 -W 512

  1. Current: CUs * 6 = 624 blocks

Grid size disparity for conv vs checknumeric kernel

  • check_numerics: 624 workgroups = 159,744 threads
  • igemm_wrw: 105,984 workgroups = 27,131,904 threads
  • 170x difference in parallelism
MIOpenDriver conv -n 3 -c 256 -H 512 -W 512 -k 192 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -i 1 -V 0
ROCProfilerV2: Collecting the following counters:
- SQ_CYCLES
Enabling Counter Collection
PRNG seed: 12345678
Timestamp: 2025-10-28 22:17:26 UTC; Host Name: 44cf089bd69b; Operating System: Linux 6.5.0-15-generic; ROCm: 6.4.43484; MIOpen Driver: 3.5.1; CPU Vendor: AMD; CPU Model: 2 x EPYC 7513; RAM Size: 503 GB; GPU Model: 4 x AMD Instinct MI210; AMDGPU Driver: 6.10.10
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.060160 ms (average)
stats: name, n, c, ho, wo, y, x, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 3, 256, 512, 512, 1, 1, 192, 77309411328, 0, 0, 1285063, 0, 0.060160
Dispatch_ID(0), GPU_ID(2), Queue_ID(2), Process_ID(19922), Thread_ID(19922), Grid_Size(26624), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(12), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_fillBufferAligned.kd"), Begin_Timestamp(19804846497731734), End_Timestamp(19804846497755254), Correlation_ID(0), SQ_CYCLES(419824.000000)
Dispatch_ID(1), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849212826632), End_Timestamp(19804849212835752), Correlation_ID(0), SQ_CYCLES(226144.000000)
Dispatch_ID(2), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849214642534), End_Timestamp(19804849428060962), Correlation_ID(0), SQ_CYCLES(2877000960.000000)
Dispatch_ID(3), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849428234880), End_Timestamp(19804849428242720), Correlation_ID(0), SQ_CYCLES(207024.000000)
Dispatch_ID(4), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849428369600), End_Timestamp(19804849428376160), Correlation_ID(0), SQ_CYCLES(182000.000000)
Dispatch_ID(5), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849428465467), End_Timestamp(19804849714608798), Correlation_ID(0), SQ_CYCLES(3857239704.000000)
Dispatch_ID(6), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849714693276), End_Timestamp(19804849714699516), Correlation_ID(0), SQ_CYCLES(184768.000000)
Dispatch_ID(7), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(65536), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("SubTensorOpWithScalar1d.kd"), Begin_Timestamp(19804849716538071), End_Timestamp(19804849716541111), Correlation_ID(0), SQ_CYCLES(140752.000000)
Dispatch_ID(8), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(50331648), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19804849716604631), End_Timestamp(19804849719992301), Correlation_ID(0), SQ_CYCLES(44721312.000000)
Dispatch_ID(9), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(37748736), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19804849720050701), End_Timestamp(19804849722710853), Correlation_ID(0), SQ_CYCLES(35152416.000000)
Dispatch_ID(10), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(105984), Workgroup_Size(256), LDS_Per_Workgroup(16384), Scratch_Per_Workitem(0), Arch_VGPR(56), Accum_VGPR(32), SGPR(80), Wave_Size(64), Kernel_Name("igemm_wrw_gtcx2_nhwc_fp32_bx0_ex0_bt64x128x16_wt32x32x2_ws1x1_wr1x2_ta1x1x1x4_1x16x1x16_tb1x1x1x8_1x16x1x16_gkgs.kd"), Begin_Timestamp(19804849722768293), End_Timestamp(19804849724910367), Correlation_ID(0), SQ_CYCLES(24566288.000000)
Dispatch_ID(11), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849725003486), End_Timestamp(19804849725010846), Correlation_ID(0), SQ_CYCLES(190768.000000)
Dispatch_ID(12), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849725080126), End_Timestamp(19804849725140286), Correlation_ID(0), SQ_CYCLES(881128.000000)
Dispatch_ID(13), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849725197726), End_Timestamp(19804849725205086), Correlation_ID(0), SQ_CYCLES(192744.000000)
  1. Proposed: Adaptive Gride Size numBlocks ∈ [104, 53,248]

Occupancy:

  • Waves per CU = (53,248 blocks × 4 waves/block) / 104 CUs = 2048 waves/CU
  • The scheduler keeps CUs fed to better hide memory latencies and fully utilize available bandwidth

Work Distribution:

  • For small tensor (3×3×512×512 = 2.3M elements):
  1. minBlocks = 2,359,296 / 4 / 256 = 2,304 (work needed)
  2. minBlocksForOccupancy = (104 × 4) / 4 = 104 (min for occupancy)
  3. std::max(2,304, 104) = 2,304

Result: Uses work-based value (2,304) since it already exceeds minimum occupancy.

For VERY small tensor (10K elements):

  1. minBlocks = 10,000 / 4 / 256 = 10 (only 10 blocks needed for work)
  2. minBlocksForOccupancy = 104
  3. std::max(10, 104) = 104

Result: Overrides work-based to ensure minimum 104 blocks to keep all CUs busy.

  • Medium tensors (10M elements): ~6 elements/thread
  • Large tensors (200M): ~118 elements/thread

The std::max acts as a floor function:

  • If tensor is large enough → use work-based value (efficient)
  • If tensor is too small → enforce minimum occupancy (prevents idle CUs)

Without this floor, a 10K element tensor would only launch 10 blocks, leaving 94 of 104 CUs completely idle!

Then std::min(..., maxBlocksForOccupancy) caps at ceiling to prevent over-subscription for huge tensors.

Together they create: numBlocks ∈ [104, 53,248] with work-based preference in between.

MIOpenDriver conv -n 3 -c 256 -H 512 -W 512 -k 192 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -i 1 -V 0
ROCProfilerV2: Collecting the following counters:
- SQ_CYCLES
Enabling Counter Collection
PRNG seed: 12345678
Timestamp: 2025-10-28 23:46:09 UTC; Host Name: 44cf089bd69b; Operating System: Linux 6.5.0-15-generic; ROCm: 6.4.43484; MIOpen Driver: 3.5.1; CPU Vendor: AMD; CPU Model: 2 x EPYC 7513; RAM Size: 503 GB; GPU Model: 4 x AMD Instinct MI210; AMDGPU Driver: 6.10.10
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.009760 ms (average)
stats: name, n, c, ho, wo, y, x, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 3, 256, 512, 512, 1, 1, 192, 77309411328, 0, 0, 7921046, 0, 0.009760
Dispatch_ID(0), GPU_ID(2), Queue_ID(2), Process_ID(21175), Thread_ID(21175), Grid_Size(26624), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(12), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_fillBufferAligned.kd"), Begin_Timestamp(19810169838401376), End_Timestamp(19810169838424736), Correlation_ID(0), SQ_CYCLES(419736.000000)
Dispatch_ID(1), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172575993557), End_Timestamp(19810172576002517), Correlation_ID(0), SQ_CYCLES(218648.000000)
Dispatch_ID(2), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172577751471), End_Timestamp(19810172588328406), Correlation_ID(0), SQ_CYCLES(142129368.000000)
Dispatch_ID(3), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172588405046), End_Timestamp(19810172588412566), Correlation_ID(0), SQ_CYCLES(202024.000000)
Dispatch_ID(4), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172588510326), End_Timestamp(19810172588517846), Correlation_ID(0), SQ_CYCLES(194360.000000)
Dispatch_ID(5), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172588581046), End_Timestamp(19810172602666772), Correlation_ID(0), SQ_CYCLES(189249576.000000)
Dispatch_ID(6), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172602725972), End_Timestamp(19810172602733492), Correlation_ID(0), SQ_CYCLES(202072.000000)
Dispatch_ID(7), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(65536), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("SubTensorOpWithScalar1d.kd"), Begin_Timestamp(19810172604346288), End_Timestamp(19810172604349328), Correlation_ID(0), SQ_CYCLES(143176.000000)
Dispatch_ID(8), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(50331648), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19810172604408688), End_Timestamp(19810172607826280), Correlation_ID(0), SQ_CYCLES(45095960.000000)
Dispatch_ID(9), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(37748736), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19810172607882759), End_Timestamp(19810172610592513), Correlation_ID(0), SQ_CYCLES(35806632.000000)
Dispatch_ID(10), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(105984), Workgroup_Size(256), LDS_Per_Workgroup(16384), Scratch_Per_Workitem(0), Arch_VGPR(56), Accum_VGPR(32), SGPR(80), Wave_Size(64), Kernel_Name("igemm_wrw_gtcx2_nhwc_fp32_bx0_ex0_bt64x128x16_wt32x32x2_ws1x1_wr1x2_ta1x1x1x4_1x16x1x16_tb1x1x1x8_1x16x1x16_gkgs.kd"), Begin_Timestamp(19810172610649793), End_Timestamp(19810172612821788), Correlation_ID(0), SQ_CYCLES(24669752.000000)
Dispatch_ID(11), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172612916507), End_Timestamp(19810172612924027), Correlation_ID(0), SQ_CYCLES(187768.000000)
Dispatch_ID(12), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172612994747), End_Timestamp(19810172613004507), Correlation_ID(0), SQ_CYCLES(222184.000000)
Dispatch_ID(13), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172613059707), End_Timestamp(19810172613067067), Correlation_ID(0), SQ_CYCLES(192920.000000)

Test Plan

Captured rocprof data showing the optimization improved performance by 20×, reducing execution time from 200 ms to 10 ms.

image

Test Result

Submission Checklist

@AnaghaRaoAMD AnaghaRaoAMD force-pushed the user/anarao/checknumeric branch from 2067a63 to e4144a6 Compare October 29, 2025 00:45
@AnaghaRaoAMD AnaghaRaoAMD marked this pull request as ready for review October 29, 2025 00:46
@AnaghaRaoAMD AnaghaRaoAMD requested a review from a team as a code owner October 29, 2025 00:46
Copy link
Contributor

@randyspauldingamd randyspauldingamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Would you mind repeating the comparison with one of the very small shapes to check for any slowdown from all the unneeded warps?

@AnaghaRaoAMD
Copy link
Contributor Author

Very nice. Would you mind repeating the comparison with one of the very small shapes to check for any slowdown from all the unneeded warps?

New revision contains adaptive grid size calculation based on amount of work and hardware occupancy.

For small tensor (3×3×512×512 = 2.3M elements):

  1. minBlocks = 2,359,296 / 4 / 256 = 2,304 (work needed)
  2. minBlocksForOccupancy = (104 × 4) / 4 = 104 (min for occupancy)
  3. std::max(2,304, 104) = 2,304

Result: Uses work-based value (2,304) since it already exceeds minimum occupancy.

For VERY small tensor (10K elements):

  1. minBlocks = 10,000 / 4 / 256 = 10 (only 10 blocks needed for work)
  2. minBlocksForOccupancy = 104
  3. std::max(10, 104) = 104

Result: Overrides work-based to ensure minimum 104 blocks to keep all CUs busy.

The std::max acts as a floor function:

  • If tensor is large enough → use work-based value (efficient)
  • If tensor is too small → enforce minimum occupancy (prevents idle CUs)

Without this floor, a 10K element tensor would only launch 10 blocks, leaving 94 of 104 CUs completely idle!

Then std::min(..., maxBlocksForOccupancy) caps at ceiling to prevent over-subscription for huge tensors.

Together they create: numBlocks ∈ [104, 6,656] with work-based preference in between.

@JonathanLichtnerAMD
Copy link
Contributor

This looks very good: 212 ms to 10 ms is amazing.

How long does the convolution take? Just wondering how this time compares relatively?

@AnaghaRaoAMD
Copy link
Contributor Author

An example of convolution with large tensor takes ~2 ms to run and on latest commit check numeric takes 1.3 - 1.8 ms compared to 210-280 ms originally

Copy link
Contributor

@JonathanLichtnerAMD JonathanLichtnerAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@AnaghaRaoAMD AnaghaRaoAMD force-pushed the user/anarao/checknumeric branch from ef9cf15 to 3e3e0ca Compare October 31, 2025 16:58
@AnaghaRaoAMD AnaghaRaoAMD force-pushed the user/anarao/checknumeric branch from 3e3e0ca to 48de74e Compare October 31, 2025 17:43
@AnaghaRaoAMD AnaghaRaoAMD merged commit 3a29102 into develop Oct 31, 2025
36 of 57 checks passed
@AnaghaRaoAMD AnaghaRaoAMD deleted the user/anarao/checknumeric branch October 31, 2025 22:16
assistant-librarian bot pushed a commit to ROCm/MIOpen that referenced this pull request Oct 31, 2025
[MIOpen] Check numeric increase grid size to hide memory
 latencies (#2341)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

For large tensor dimensions there was significant SQ_WAIT_ANY cycles due
to small grid size. rocprof showed almost ~75-85% of cycles are wait
cycles.

## Technical Details

Take convolution with `./bin/MIOpenDriver conv -n 3 -c 256 -H 512 -W
512`

1. Current: `CUs * 6` = 624 blocks

Grid size disparity for conv vs checknumeric kernel

- check_numerics: 624 workgroups = 159,744 threads
- igemm_wrw: 105,984 workgroups = 27,131,904 threads
- **170x** difference in parallelism

```
MIOpenDriver conv -n 3 -c 256 -H 512 -W 512 -k 192 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -i 1 -V 0
ROCProfilerV2: Collecting the following counters:
- SQ_CYCLES
Enabling Counter Collection
PRNG seed: 12345678
Timestamp: 2025-10-28 22:17:26 UTC; Host Name: 44cf089bd69b; Operating System: Linux 6.5.0-15-generic; ROCm: 6.4.43484; MIOpen Driver: 3.5.1; CPU Vendor: AMD; CPU Model: 2 x EPYC 7513; RAM Size: 503 GB; GPU Model: 4 x AMD Instinct MI210; AMDGPU Driver: 6.10.10
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.060160 ms (average)
stats: name, n, c, ho, wo, y, x, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 3, 256, 512, 512, 1, 1, 192, 77309411328, 0, 0, 1285063, 0, 0.060160
Dispatch_ID(0), GPU_ID(2), Queue_ID(2), Process_ID(19922), Thread_ID(19922), Grid_Size(26624), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(12), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_fillBufferAligned.kd"), Begin_Timestamp(19804846497731734), End_Timestamp(19804846497755254), Correlation_ID(0), SQ_CYCLES(419824.000000)
Dispatch_ID(1), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849212826632), End_Timestamp(19804849212835752), Correlation_ID(0), SQ_CYCLES(226144.000000)
Dispatch_ID(2), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849214642534), End_Timestamp(19804849428060962), Correlation_ID(0), SQ_CYCLES(2877000960.000000)
Dispatch_ID(3), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849428234880), End_Timestamp(19804849428242720), Correlation_ID(0), SQ_CYCLES(207024.000000)
Dispatch_ID(4), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849428369600), End_Timestamp(19804849428376160), Correlation_ID(0), SQ_CYCLES(182000.000000)
Dispatch_ID(5), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849428465467), End_Timestamp(19804849714608798), Correlation_ID(0), SQ_CYCLES(3857239704.000000)
Dispatch_ID(6), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849714693276), End_Timestamp(19804849714699516), Correlation_ID(0), SQ_CYCLES(184768.000000)
Dispatch_ID(7), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(65536), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("SubTensorOpWithScalar1d.kd"), Begin_Timestamp(19804849716538071), End_Timestamp(19804849716541111), Correlation_ID(0), SQ_CYCLES(140752.000000)
Dispatch_ID(8), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(50331648), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19804849716604631), End_Timestamp(19804849719992301), Correlation_ID(0), SQ_CYCLES(44721312.000000)
Dispatch_ID(9), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(37748736), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19804849720050701), End_Timestamp(19804849722710853), Correlation_ID(0), SQ_CYCLES(35152416.000000)
Dispatch_ID(10), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(105984), Workgroup_Size(256), LDS_Per_Workgroup(16384), Scratch_Per_Workitem(0), Arch_VGPR(56), Accum_VGPR(32), SGPR(80), Wave_Size(64), Kernel_Name("igemm_wrw_gtcx2_nhwc_fp32_bx0_ex0_bt64x128x16_wt32x32x2_ws1x1_wr1x2_ta1x1x1x4_1x16x1x16_tb1x1x1x8_1x16x1x16_gkgs.kd"), Begin_Timestamp(19804849722768293), End_Timestamp(19804849724910367), Correlation_ID(0), SQ_CYCLES(24566288.000000)
Dispatch_ID(11), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849725003486), End_Timestamp(19804849725010846), Correlation_ID(0), SQ_CYCLES(190768.000000)
Dispatch_ID(12), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(624), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19804849725080126), End_Timestamp(19804849725140286), Correlation_ID(0), SQ_CYCLES(881128.000000)
Dispatch_ID(13), GPU_ID(2), Queue_ID(1), Process_ID(19922), Thread_ID(19922), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19804849725197726), End_Timestamp(19804849725205086), Correlation_ID(0), SQ_CYCLES(192744.000000)
```
2. Proposed: Adaptive Gride Size __numBlocks ∈ [104, 53,248]__

Occupancy:
- Waves per CU = (53,248 blocks × 4 waves/block) / 104 CUs = 2048
waves/CU
- The scheduler keeps CUs fed to better hide memory latencies and fully
utilize available bandwidth

Work Distribution:

- __For small tensor (3×3×512×512 = 2.3M elements):__

1. __minBlocks__ = 2,359,296 / 4 / 256 = __2,304__ (work needed)
2. __minBlocksForOccupancy__ = (104 × 4) / 4 = __104__ (min for
occupancy)
3. __std::max(2,304, 104) = 2,304__ ✓

Result: Uses work-based value (2,304) since it already exceeds minimum
occupancy.

__For VERY small tensor (10K elements):__

1. __minBlocks__ = 10,000 / 4 / 256 = __10__ (only 10 blocks needed for
work)
2. __minBlocksForOccupancy__ = __104__
3. __std::max(10, 104) = 104__ ✓

Result: Overrides work-based to ensure __minimum 104 blocks__ to keep
all CUs busy.

- Medium tensors (10M elements): ~6 elements/thread
- Large tensors (200M): ~118 elements/thread

The `std::max` acts as a __floor function__:

- If tensor is large enough → use work-based value (efficient)
- If tensor is too small → enforce minimum occupancy (prevents idle CUs)

Without this floor, a 10K element tensor would only launch 10 blocks,
leaving 94 of 104 CUs completely idle!

__Then `std::min(..., maxBlocksForOccupancy)` caps at ceiling__ to
prevent over-subscription for huge tensors.

Together they create: __numBlocks ∈ [104, 53,248]__ with work-based
preference in between.

```
MIOpenDriver conv -n 3 -c 256 -H 512 -W 512 -k 192 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -i 1 -V 0
ROCProfilerV2: Collecting the following counters:
- SQ_CYCLES
Enabling Counter Collection
PRNG seed: 12345678
Timestamp: 2025-10-28 23:46:09 UTC; Host Name: 44cf089bd69b; Operating System: Linux 6.5.0-15-generic; ROCm: 6.4.43484; MIOpen Driver: 3.5.1; CPU Vendor: AMD; CPU Model: 2 x EPYC 7513; RAM Size: 503 GB; GPU Model: 4 x AMD Instinct MI210; AMDGPU Driver: 6.10.10
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.009760 ms (average)
stats: name, n, c, ho, wo, y, x, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 3, 256, 512, 512, 1, 1, 192, 77309411328, 0, 0, 7921046, 0, 0.009760
Dispatch_ID(0), GPU_ID(2), Queue_ID(2), Process_ID(21175), Thread_ID(21175), Grid_Size(26624), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(12), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_fillBufferAligned.kd"), Begin_Timestamp(19810169838401376), End_Timestamp(19810169838424736), Correlation_ID(0), SQ_CYCLES(419736.000000)
Dispatch_ID(1), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172575993557), End_Timestamp(19810172576002517), Correlation_ID(0), SQ_CYCLES(218648.000000)
Dispatch_ID(2), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172577751471), End_Timestamp(19810172588328406), Correlation_ID(0), SQ_CYCLES(142129368.000000)
Dispatch_ID(3), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172588405046), End_Timestamp(19810172588412566), Correlation_ID(0), SQ_CYCLES(202024.000000)
Dispatch_ID(4), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172588510326), End_Timestamp(19810172588517846), Correlation_ID(0), SQ_CYCLES(194360.000000)
Dispatch_ID(5), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172588581046), End_Timestamp(19810172602666772), Correlation_ID(0), SQ_CYCLES(189249576.000000)
Dispatch_ID(6), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172602725972), End_Timestamp(19810172602733492), Correlation_ID(0), SQ_CYCLES(202072.000000)
Dispatch_ID(7), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(65536), Workgroup_Size(256), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(8), Accum_VGPR(0), SGPR(16), Wave_Size(64), Kernel_Name("SubTensorOpWithScalar1d.kd"), Begin_Timestamp(19810172604346288), End_Timestamp(19810172604349328), Correlation_ID(0), SQ_CYCLES(143176.000000)
Dispatch_ID(8), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(50331648), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19810172604408688), End_Timestamp(19810172607826280), Correlation_ID(0), SQ_CYCLES(45095960.000000)
Dispatch_ID(9), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(37748736), Workgroup_Size(256), LDS_Per_Workgroup(4608), Scratch_Per_Workitem(0), Arch_VGPR(24), Accum_VGPR(0), SGPR(48), Wave_Size(64), Kernel_Name("batched_transpose_32x32_dword.kd"), Begin_Timestamp(19810172607882759), End_Timestamp(19810172610592513), Correlation_ID(0), SQ_CYCLES(35806632.000000)
Dispatch_ID(10), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(105984), Workgroup_Size(256), LDS_Per_Workgroup(16384), Scratch_Per_Workitem(0), Arch_VGPR(56), Accum_VGPR(32), SGPR(80), Wave_Size(64), Kernel_Name("igemm_wrw_gtcx2_nhwc_fp32_bx0_ex0_bt64x128x16_wt32x32x2_ws1x1_wr1x2_ta1x1x1x4_1x16x1x16_tb1x1x1x8_1x16x1x16_gkgs.kd"), Begin_Timestamp(19810172610649793), End_Timestamp(19810172612821788), Correlation_ID(0), SQ_CYCLES(24669752.000000)
Dispatch_ID(11), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172612916507), End_Timestamp(19810172612924027), Correlation_ID(0), SQ_CYCLES(187768.000000)
Dispatch_ID(12), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(6656), Workgroup_Size(256), LDS_Per_Workgroup(4096), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("check_numerics_fp32.kd"), Begin_Timestamp(19810172612994747), End_Timestamp(19810172613004507), Correlation_ID(0), SQ_CYCLES(222184.000000)
Dispatch_ID(13), GPU_ID(2), Queue_ID(1), Process_ID(21175), Thread_ID(21175), Grid_Size(512), Workgroup_Size(512), LDS_Per_Workgroup(0), Scratch_Per_Workitem(0), Arch_VGPR(20), Accum_VGPR(4), SGPR(32), Wave_Size(64), Kernel_Name("__amd_rocclr_copyBuffer.kd"), Begin_Timestamp(19810172613059707), End_Timestamp(19810172613067067), Correlation_ID(0), SQ_CYCLES(192920.000000)
```

## Test Plan

Captured rocprof data showing the optimization improved performance by
**20×**, reducing execution time from 200 ms to 10 ms.

<img width="749" height="1119" alt="image"
src="https://github.com/user-attachments/assets/bcf079ef-10bc-4fca-b301-aa65b133a669"
/>

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants