[Kernel]Add streamK for block-quantized CUTLASS kernels #12978

Hongbosherlock · 2025-02-09T07:38:16Z

Thanks to #11868

Currently only supports scale_a block shapes of 1x128 and scale_b block shapes of 128x128 (for deepseek v3)

This CUTLASS GEMM is slower when K is very large, such as m128-n1536-k7168 in deepseek v3.
To address this, I explored using both Stream-K and Split-K strategies to accelerate the GEMM computation.

Among these approaches, Stream-K demonstrated superior performance in most cases, achieving over 60% performance improvement compared to the baseline.

To achieve overall better performance, use Stream-K only when K > 3N.

Below are the test results based on the GEMM shape of DeepSeek-V3, tested on H800 with TP-4.

[----------------- scaled-torch.float8_e4m3fn-gemm ------------------]
                           |  cutlass_fp8_fp8_fp16_scaled_mm_blockwise
  -------------------------------------------------------------------------------  
                                             cutlass-base             streamk                                 
1 threads: ----------------------------------------------------------------------
      MKN=(128x7168x1536)  |                    28.5                  17.1                large K       
      MKN=(128x1536x6144)  |                    10.4                  10.7
      MKN=(128x4096x7168)  |                    19.3                  19.3
      MKN=(128x7168x9216)  |                    32.7                  32.7
      MKN=(128x4608x7168)  |                    20.6                  20.6
      MKN=(128x7168x1024)  |                    26.4                  17                  large K
      MKN=(128x512x7168)   |                    10.4                  10.8
      MKN=(256x7168x4096)  |                    28.9                  29.3
      MKN=(256x7168x2048)  |                    29.8                  22                  large K

Times are in microseconds (us).

github-actions · 2025-02-09T07:38:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: leoneo <[email protected]>

robertgshaw2-redhat · 2025-02-09T17:02:25Z

cc @LucasWilkinson

csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh

csrc/quantization/cutlass_w8a8/c3x/cutlass_gemm_caller.cuh

LucasWilkinson · 2025-02-10T02:52:41Z

@Hongbosherlock Awesome thanks for the contribution! thats some nice speedups! Left a couple nits.

Hongbosherlock · 2025-02-10T03:21:39Z

@Hongbosherlock Awesome thanks for the contribution! thats some nice speedups! Left a couple nits.

Thank you for the feedback! I’ll take a look at the nits and address them ASAP.

Hongbosherlock · 2025-02-11T05:50:12Z

Hi @LucasWilkinson
Thank you for the feedback! I’ve addressed the nits and verified the performance improvements.
Could you please check if it’s ready to be merged?

Signed-off-by: leoneo <[email protected]>

LucasWilkinson

LGTM, thanks for the hardwork!

…#12978)

…#12978) Signed-off-by: Louis Ulmer <[email protected]>

…#12978)

Add streamK for block-quantized CUTLASS kernels

1c3b5ea

Signed-off-by: leoneo <[email protected]>

Hongbosherlock force-pushed the cutlass_streamK branch from dc96d3a to 1c3b5ea Compare February 9, 2025 07:42

Add streamK for block-quantized CUTLASS kernels

68f18b4

Signed-off-by: leoneo <[email protected]>

Hongbosherlock force-pushed the cutlass_streamK branch from 1c3b5ea to 68f18b4 Compare February 9, 2025 09:29

Hongbosherlock added 2 commits February 9, 2025 17:33

[Kernel]Add streamK for block-quantized CUTLASS kernels

2dff762

Signed-off-by: leoneo <[email protected]>

[Kernel]Add streamK for block-quantized CUTLASS kernels

6e20e0f

Signed-off-by: leoneo <[email protected]>

Hongbosherlock changed the title ~~Add streamK for block-quantized CUTLASS kernels~~ [Kernel]Add streamK for block-quantized CUTLASS kernels Feb 9, 2025

LucasWilkinson reviewed Feb 10, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh Outdated Show resolved Hide resolved

LucasWilkinson reviewed Feb 10, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh Outdated Show resolved Hide resolved

LucasWilkinson reviewed Feb 10, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/c3x/cutlass_gemm_caller.cuh Outdated Show resolved Hide resolved

fix some nits

b986c01

Signed-off-by: leoneo <[email protected]>

Hongbosherlock force-pushed the cutlass_streamK branch from 11fbfed to b986c01 Compare February 11, 2025 06:00

LucasWilkinson reviewed Feb 11, 2025

View reviewed changes

LucasWilkinson approved these changes Feb 11, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 11, 2025

DarkLight1337 enabled auto-merge (squash) February 21, 2025 06:14

simon-mo merged commit 839b27c into vllm-project:main Feb 21, 2025
64 of 70 checks passed

Hongbosherlock deleted the cutlass_streamK branch February 23, 2025 07:57

LucasWilkinson mentioned this pull request Feb 26, 2025

Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM #13917

Merged

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Kernel]Add streamK for block-quantized CUTLASS kernels (vllm-project…

0161781

…#12978)

i-zhen mentioned this pull request Mar 5, 2025

Support StreamK when scheduling deepseek-ai/DeepGEMM#41

Open

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Kernel]Add streamK for block-quantized CUTLASS kernels (vllm-project…

81902d2

…#12978) Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Kernel]Add streamK for block-quantized CUTLASS kernels (vllm-project…

0b8ffe7

…#12978)

yuan-luo mentioned this pull request Aug 7, 2025

[sgl-kernel] 1/N Refactor sglang cutlass 3x - gemm fp8 blockwise sm90 sgl-project/sglang#8913

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel]Add streamK for block-quantized CUTLASS kernels #12978

[Kernel]Add streamK for block-quantized CUTLASS kernels #12978

Uh oh!

Hongbosherlock commented Feb 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 9, 2025

Uh oh!

robertgshaw2-redhat commented Feb 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson commented Feb 10, 2025

Uh oh!

Hongbosherlock commented Feb 10, 2025

Uh oh!

Hongbosherlock commented Feb 11, 2025

Uh oh!

LucasWilkinson left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Kernel]Add streamK for block-quantized CUTLASS kernels #12978

[Kernel]Add streamK for block-quantized CUTLASS kernels #12978

Uh oh!

Conversation

Hongbosherlock commented Feb 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 9, 2025

Uh oh!

robertgshaw2-redhat commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson commented Feb 10, 2025

Uh oh!

Hongbosherlock commented Feb 10, 2025

Uh oh!

Hongbosherlock commented Feb 11, 2025

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Hongbosherlock commented Feb 9, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Feb 9, 2025 •

edited

Loading

LucasWilkinson left a comment •

edited

Loading