A8w8 asm codegen and tune #1161

yzhou103 · 2025-10-11T03:32:38Z

Motivation

update a8w8 bpreshuffle asm code and add it to tune

Technical Details

add asm a8w8 bpreshuffle int8 codegen
add asm a8w8 bpreshuffle int8 to gemm_a8w8_bpreshuffle_tune.py
refactor gemm_a8w8_bpreshuffle_tune to support int8 tuning, add q_dtype_w for different quantization method.
update asm a8w8 bpreshuffle int8 tuned shapes

Test Plan

python op_tests/test_gemm_a8w8.py
aiter/csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.py

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull Request Overview

This PR adds ASM A8W8 bpreshuffle int8 codegen and integrates it into the tuning system. The changes extend the existing A8W8 bpreshuffle GEMM implementation to support int8 quantization alongside fp8, introducing new ASM kernels and updating the tuning framework to handle multiple quantization data types.

Adds ASM int8 kernel configuration and codegen for A8W8 bpreshuffle GEMM
Refactors tuning framework to support both fp8 and int8 quantization methods via q_dtype_w parameter
Updates kernel selection logic and API signatures to support new ASM kernels

Reviewed Changes

Copilot reviewed 14 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
hsa/gfx942/i8gemm/i8gemm_bf16_perTokenI8.csv	New kernel configuration for int8 ASM kernels
hsa/gfx942/i8gemm/codegen.py	Code generator for ASM i8gemm kernel configurations
csrc/py_itfs_cu/asm_gemm_a8w8.cu	Major refactoring of ASM GEMM interface with kernel selection logic
csrc/include/rocm_ops.hpp	Updated Python binding parameters for new ASM interface
csrc/include/asm_gemm_a8w8.h	Updated function signature for new parameters
csrc/ck_gemm_a8w8_bpreshuffle/gen_instances.py	Added filtering for int8 dtype in tuning
csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.py	Major refactoring to support both fp8 and int8 tuning
csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.cu	Updated to support BFloat16 output
csrc/ck_gemm_a8w8_bpreshuffle/README.md	Documentation updates for new q_dtype_w parameter
aiter/utility/base_tuner.py	Base tuner improvements for result handling
aiter/ops/gemm_op_a8w8.py	Updated GEMM operations to use new configuration system
aiter/jit/optCompilerConfig.json	Added blob generation command for i8gemm
aiter/configs/asm_a8w8_gemm.csv	Updated ASM kernel configurations
aiter/configs/a8w8_bpreshuffle_untuned_gemm.csv	Added q_dtype_w column and int8 test cases

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-11T03:33:35Z

aiter/configs/a8w8_bpreshuffle_untuned_gemm.csv

-32768,7168,2304
-32768,512,7168
-32768,7168,256
+M,N,K,,q_dtype_w


There's an extra comma between K and q_dtype_w in the CSV header.

Suggested change

M,N,K,,q_dtype_w

M,N,K,q_dtype_w

Copilot · 2025-10-11T03:33:35Z

aiter/configs/a8w8_bpreshuffle_untuned_gemm.csv

+16384,7168,2304,dtypes.fp8
+16384,512,7168,dtypes.fp8
+16384,7168,256,dtypes.fp8
+32768,4096,512, dtypes.fp8


There's an extra space before 'dtypes.fp8' that makes the format inconsistent with other rows.

Suggested change

32768,4096,512, dtypes.fp8

32768,4096,512,dtypes.fp8

Copilot · 2025-10-11T03:33:36Z

csrc/py_itfs_cu/asm_gemm_a8w8.cu

+
+    if(config_map->empty())
+    {
+        TORCH_CHECK(false, __func__, " no kernel support a4w4 for this gpu arch");


Error message mentions 'a4w4' but should be 'a8w8' since this is the A8W8 GEMM function.

Suggested change

TORCH_CHECK(false, __func__, " no kernel support a4w4 for this gpu arch");

TORCH_CHECK(false, __func__, " no kernel support a8w8 for this gpu arch");

Copilot · 2025-10-11T03:33:36Z

csrc/include/rocm_ops.hpp

    m.def("gemm_a8w8_asm",                                              \
          &gemm_a8w8_asm,                                               \
-          "Asm gemm a8w8 ,  weight should be shuffle to layout(32,16)", \
+          "Asm gemm a8w8 ,  weight should be shuffle to layout(16,16)", \


Comment incorrectly states layout(16,16) but based on the code changes, it should be layout(32,16) for int8 kernels.

Suggested change

"Asm gemm a8w8 , weight should be shuffle to layout(16,16)", \

"Asm gemm a8w8 , weight should be shuffle to layout(32,16)", \

amd-zfyu and others added 13 commits September 11, 2025 06:11

add new branch for a8w8:codegen

a3fdfeb

ini codegen for i8 gemm

0e5965a

formate code

a7c3efe

add 192x128 asm kernel

6ca21da

hands on tune for shape from bytes

3c5bf00

del old kernel

66778b1

add new kernel

cfe5750

fix splitk=1 hang bugs

d5e397d

[Mi308]update i8 gemm co

9459562

add new kernel v2 & fix splitk=0 errors

e0f162b

update asm i8 co to fix bugs

71025cb

add asm_int8 a8w8_bpreshuffle tune

3a3f892

update a8w8_asm int8 tuned shape

ab99221

Copilot AI review requested due to automatic review settings October 11, 2025 03:32

Copilot AI reviewed Oct 11, 2025

View reviewed changes

yzhou103 added 7 commits October 11, 2025 11:36

Merge branch 'main' into a8w8_asm_codegen

9facac6

update a8w8_bpreshuffle_tuned/untuned_gemm

d2e2842

fix lint err

c47a2e4

fix gen_instance error

4c5364d

fix lint err

1eae39f

Merge branch 'main' into a8w8_asm_codegen

a405f44

update gemm_op_a8w8 with tuned_config

7a06b9b

yzhou103 changed the title ~~A8w8 asm codegen and tune~~ [Mi35x] A8w8 asm codegen and tune Oct 13, 2025

yzhou103 added 7 commits October 13, 2025 11:22

fix typo

3b58066

fix gen_instances error

b0b0286

Merge branch 'main' into a8w8_asm_codegen

882d2fb

Merge branch 'main' into a8w8_asm_codegen

6ff71fb

Merge branch 'main' into a8w8_asm_codegen

e2ad37a

fix compiler error in asm_gemm_a8w8.cu

9abcb99

Merge branch 'main' into a8w8_asm_codegen

01dd88a

yzhou103 added 4 commits October 15, 2025 14:30

Merge branch 'main' into a8w8_asm_codegen

fcbced4

fix gen_instance when there is asm kernels

9581a16

Merge branch 'main' into a8w8_asm_codegen

d4c1eeb

Merge branch 'main' into a8w8_asm_codegen

aa62067

yzhou103 changed the title ~~[Mi35x] A8w8 asm codegen and tune~~ A8w8 asm codegen and tune Oct 16, 2025

yzhou103 added 2 commits October 17, 2025 12:51

Merge branch 'main' into a8w8_asm_codegen

c290729

Merge branch 'main' into a8w8_asm_codegen

982d5ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A8w8 asm codegen and tune #1161

A8w8 asm codegen and tune #1161

Uh oh!

yzhou103 commented Oct 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 11, 2025

Uh oh!

Copilot AI Oct 11, 2025

Uh oh!

Copilot AI Oct 11, 2025

Uh oh!

Copilot AI Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	TORCH_CHECK(false, __func__, " no kernel support a4w4 for this gpu arch");
	TORCH_CHECK(false, __func__, " no kernel support a8w8 for this gpu arch");

	"Asm gemm a8w8 , weight should be shuffle to layout(16,16)", \
	"Asm gemm a8w8 , weight should be shuffle to layout(32,16)", \

A8w8 asm codegen and tune #1161

Are you sure you want to change the base?

A8w8 asm codegen and tune #1161

Uh oh!

Conversation

yzhou103 commented Oct 11, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants