-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Kernel]Add streamK for block-quantized CUTLASS kernels #12978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: leoneo <[email protected]>
dc96d3a
to
1c3b5ea
Compare
Signed-off-by: leoneo <[email protected]>
1c3b5ea
to
68f18b4
Compare
Signed-off-by: leoneo <[email protected]>
Signed-off-by: leoneo <[email protected]>
csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh
Outdated
Show resolved
Hide resolved
csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh
Outdated
Show resolved
Hide resolved
@Hongbosherlock Awesome thanks for the contribution! thats some nice speedups! Left a couple nits. |
Thank you for the feedback! I’ll take a look at the nits and address them ASAP. |
Hi @LucasWilkinson |
Signed-off-by: leoneo <[email protected]>
11fbfed
to
b986c01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the hardwork!
…#12978) Signed-off-by: Louis Ulmer <[email protected]>
Thanks to #11868
Currently only supports scale_a block shapes of 1x128 and scale_b block shapes of 128x128 (for deepseek v3)
This CUTLASS GEMM is slower when K is very large, such as
m128-n1536-k7168
in deepseek v3.To address this, I explored using both
Stream-K
andSplit-K
strategies to accelerate the GEMM computation.Among these approaches,
Stream-K
demonstrated superior performance in most cases, achieving over 60% performance improvement compared to the baseline.To achieve overall better performance, use
Stream-K
only whenK > 3N
.Below are the test results based on the GEMM shape of DeepSeek-V3, tested on H800 with TP-4.