Skip to content

Conversation

yongwww
Copy link
Collaborator

@yongwww yongwww commented Aug 30, 2025

📌 Description

It depends on #1608, mainly the cutlass fp8 gemm support for sm120/121, will rebase after #1608 lands.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

self,
inputs: List[torch.Tensor],
tactic: int = -1,
do_preparation: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this parameter is unused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. They’re part of TunableRunner, keeping them for consistency with the others.

@yongwww yongwww force-pushed the sm120_cutlass_fp8_gemm branch from 296583e to c172dfd Compare September 3, 2025 00:17
@yongwww yongwww marked this pull request as ready for review September 3, 2025 00:29
@aleozlx
Copy link
Collaborator

aleozlx commented Sep 3, 2025

looks good no further comments from me

constexpr int SCALE_GRANULARITY_M = 1; /* Always 1 for SM120 */ \
constexpr int SCALE_GRANULARITY_K = 128; /* Always 128 for SM120 per CUTLASS requirement */ \
if (scale_granularity_m != 1) { \
TORCH_CHECK(false, "SM120 only supports scale_granularity_m=1"); \
Copy link
Collaborator

@yzh119 yzh119 Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will run into static assertion failed with "Scale Granularity M must evenly divide the tile shape M."

Copy link
Collaborator

@yzh119 yzh119 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@yongwww yongwww Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. I used a standalone test (not in this pr) to trigger that error message. will go with the #1610 (comment)

@yzh119
Copy link
Collaborator

yzh119 commented Sep 4, 2025

There might be some misunderstanding of MmaSM here: we use it in sm100 gemm because sm100 supports tcgen05 and 2-cta mode (where 2 ctas cooperatively perform a mma computation).

However, sm120 do not have tcgen05 and 2-cta mma, so MmaSM doesn't make sense here, it should always be 1 in sm120.

The condition used separate to the two cases in https://github.com/flashinfer-ai/flashinfer/pull/1610/files#diff-0977093a8d2429e66dab4cc40f31563717098cb5aca4354a814e4208f58f068bR78 should not be MmaSM == 1, but Cooperative/PingPong schedule: https://github.com/NVIDIA/cutlass/blob/b2dd65dc864e09688245b316ac46c4a6cd07e15c/examples/87_blackwell_geforce_gemm_blockwise/87b_blackwell_geforce_fp8_bf16_gemm_groupwise.cu#L120-L123.

Please consider the following changes:

  1. remove the confusing MmaSM argument, there is no concept of 2SM in sm120.
  2. if we want to support both PingPong and Cooperative GEMM, please refer to https://github.com/NVIDIA/cutlass/blob/b2dd65dc864e09688245b316ac46c4a6cd07e15c/examples/87_blackwell_geforce_gemm_blockwise/87b_blackwell_geforce_fp8_bf16_gemm_groupwise.cu#L166-L171

@yongwww yongwww force-pushed the sm120_cutlass_fp8_gemm branch from 98fdd75 to 3a5cd77 Compare September 4, 2025 02:19
@yongwww
Copy link
Collaborator Author

yongwww commented Sep 4, 2025

Thanks, @yzh119, @nvmbreughe , @aleozlx for the helpful and insightful comments! I’ve incorporated them. Please take a look. For the PingPong gemm, I left it as a todo for now; the current default in the cutlass examples is cooperative gemm.

constexpr int SCALE_GRANULARITY_K = 128; /* equal tile K dimension*/ \
if (scale_granularity_m != 1) { \
TORCH_CHECK(false, \
"SM120 only supports scale_granularity_m=1 to ensure compatibility with all " \
Copy link
Collaborator

@yzh119 yzh119 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the case after your changes? If not, let's add 128 back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! A divisor of 128 should be valid. I added scale_granularity_m=1 and scale_granularity_m=128, the change was: https://github.com/flashinfer-ai/flashinfer/pull/1610/files#diff-68929275a79ec730031c1d5bec894f35ba6e932a08841fdae63087a6937c0f4fR44-R52

@yongwww yongwww force-pushed the sm120_cutlass_fp8_gemm branch from b0ed858 to 5f40472 Compare September 4, 2025 16:17
@yongwww yongwww force-pushed the sm120_cutlass_fp8_gemm branch from 3c22b87 to 305be25 Compare September 4, 2025 17:04
@yongwww yongwww merged commit 90abf04 into flashinfer-ai:main Sep 4, 2025
2 checks passed
@yongwww yongwww deleted the sm120_cutlass_fp8_gemm branch September 4, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants