[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

minosfuture · 2025-06-27T05:47:06Z

Purpose

#19667 changed the workspace creation from torch.zeros to torch.empty. This ends up causing correctness for models using cutlass_moe, e.g. Maverick in our test case. This PR fixes the correctness issue by explicitly filling zeros in cutlass_moe.

Test Plan

lm_eval
added ut that would fail without this fix

Test Result

lm_eval results:

local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8081/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.935	±	0.0175
		strict-match	5	exact_match	↑	0.920	±	0.0192

unit test stability verified:

without c1.fill_(0), the following one liner verifies stable failure:

for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null && { echo "shouldn't succeed"; exit 1; } done

with c1.fill_(0), the following verifies stable success:

for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null || { echo "should succeed"; exit 1; } done

(Optional) Documentation Update

github-actions · 2025-06-27T05:47:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @minosfuture, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical correctness bug in the cutlass_moe kernel, which was caused by uninitialized memory in the workspace after a recent change from torch.zeros to torch.empty. My fix ensures that the relevant workspace is explicitly zero-filled, restoring correct behavior for models utilizing this kernel, such as Maverick. I have also added comprehensive unit tests to prevent future regressions and expand test coverage for MoE operations.

Highlights

Bug Fix: MoE Correctness: I have addressed a correctness issue in cutlass_moe by explicitly filling the c1 workspace (derived from workspace13) with zeros. This resolves a problem where torch.empty was used for workspace creation, leading to uninitialized memory causing incorrect results for models like Maverick.
New Unit Test for Workspace Initialization: I have added a new unit test, test_run_cutlass_moe_fp8, which specifically verifies that the run_cutlass_moe_fp8 function produces identical results whether its internal workspace is initialized with random values or explicitly filled with zeros. This test directly validates the bug fix and ensures future regressions are caught.
Expanded Test Coverage: I have introduced test_cutlass_moe_8_bit_EP_large, a new parameterized test case for larger 8-bit MoE configurations, to further enhance the test suite's coverage and robustness.
Minor API Correction: I have corrected the argument order for the activation_callable lambda function within the CutlassMoe.apply method to ensure proper function signature matching.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a correctness issue in cutlass_moe by filling zero to cache space. The added unit test validates the fix. I have a suggestion to improve the efficiency of the new unit test.

tests/kernels/moe/test_cutlass_moe.py

yeqcharlotte

thanks for the unit test to repro the issue!

cc: @bnellnm to also take a look!

yeqcharlotte · 2025-06-27T06:24:53Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
        c1 = _resize_cache(workspace13, (M * topk, N * 2))
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))
+        c1.fill_(0)


sounds like this should impact both chunking and non chunking path?

I think the non-chunking (batched) path is not impacted because c1 is fully overridden. I don't have solid proof though (I need to look into that code path more). @bnellnm / @ElizaWszola comments?

I tested this locally and found that the batched case needs to be cleared also. I think it's probably best to unconditionally zero out c1

I see. updated. Could you share how to run batched case tests? thx.

I ran test_cutlass_moe.py and test_pplx_cutlass_moe.py

I think the. condition should be expert_map is not None or self.use_batched_format. Batched mode is almost always going to have some garbage space in the tensor.

Is the failure caused by reading in garbage data for dynamic per-tensor quantization?

Does it work in the static per-tensor case?

yep, exactly. And it works, without zero-out, in the static scale case.

yeqcharlotte · 2025-06-27T06:25:11Z

tests/kernels/moe/test_cutlass_moe.py

@@ -365,3 +369,131 @@ def test_cutlass_moe_8_bit_EP(
                                   cutlass_output,
                                   atol=5e-2,
                                   rtol=1e-2)
+
+
+@pytest.mark.parametrize("m,n,k,topk", [(1, 8192, 5120, 31)])


let's have some m > 32k

yeqcharlotte · 2025-06-27T06:41:53Z

tests/kernels/moe/test_cutlass_moe.py

 from vllm.model_executor.layers.fused_moe.fused_moe import (fused_experts,
                                                            fused_topk)
+from vllm.model_executor.layers.fused_moe.utils import (
+    moe_kernel_quantize_input)
 from vllm.platforms import current_platform

 NUM_EXPERTS = [40, 64]


does it help with the working sets at line 38-39?

unfortunately no. We can look into this more separately.

ElizaWszola · 2025-06-27T12:29:26Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
        c1 = _resize_cache(workspace13, (M * topk, N * 2))
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))
+        c1.fill_(0)


Is this needed when we don't use expert_map? In case it's not, can you write a condition for this?

thanks. updated!

Why does the expert map matter here?

only when expert_map is in use, the c1 will not be fully overridden. Note that c1 is resized into (# token globally, 2N), while the actually used space is (# token locally, 2N). When expert_map is in use, the local token size can be smaller.

bnellnm · 2025-06-27T15:23:16Z

tests/kernels/moe/test_cutlass_moe.py

+    per_out_channel: bool,
+    ep_size: int,
+):
+    current_platform.seed_everything(7)


If the body of this is the same as test_cutlass_moe_8_bit_EP can you factor it out into a common function?

Good point! thx. Updated with refactoring a few similar test functions here.

minosfuture · 2025-06-27T22:36:34Z

tests/kernels/moe/test_cutlass_moe.py

 ):
    current_platform.seed_everything(7)
-    monkeypatch.setenv("VLLM_FUSED_MOE_CHUNK_SIZE", "8192")


this is not very git-friendly, but note this line is not removed during refactoring. see test_cutlass_moe_8_bit_no_graph

mergify · 2025-07-03T00:03:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @minosfuture.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

A couple of questions but good to land once rebased. Thanks for the fix

tlrmchlsmth · 2025-07-03T00:47:48Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
        c1 = _resize_cache(workspace13, (M * topk, N * 2))
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))
+        c1.fill_(0)


Why does the expert map matter here?

tlrmchlsmth · 2025-07-03T00:50:08Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
        c1 = _resize_cache(workspace13, (M * topk, N * 2))
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))
+        c1.fill_(0)


Is the failure caused by reading in garbage data for dynamic per-tensor quantization?

Does it work in the static per-tensor case?

minosfuture · 2025-07-03T22:03:06Z

@tlrmchlsmth this should be good to go. Could you help trigger CI/auto-merge? thx!

Signed-off-by: Ming Yang <[email protected]>

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]> Signed-off-by: Patrick von Platen <[email protected]>

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]> Signed-off-by: avigny <[email protected]>

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

minosfuture requested review from tlrmchlsmth and WoosukKwon as code owners June 27, 2025 05:47

gemini-code-assist bot reviewed Jun 27, 2025

View reviewed changes

tests/kernels/moe/test_cutlass_moe.py Show resolved Hide resolved

minosfuture force-pushed the fix_maverick_correctness branch from 66c457b to 25d3af8 Compare June 27, 2025 06:07

minosfuture mentioned this pull request Jun 27, 2025

[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE #20166

Merged

yeqcharlotte requested review from mgoin and houseroad June 27, 2025 06:43

yeqcharlotte reviewed Jun 27, 2025

View reviewed changes

ElizaWszola reviewed Jun 27, 2025

View reviewed changes

bnellnm reviewed Jun 27, 2025

View reviewed changes

minosfuture commented Jun 27, 2025

View reviewed changes

minosfuture mentioned this pull request Jul 1, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe minosfuture/vllm#2

Closed

mergify bot added the needs-rebase label Jul 3, 2025

tlrmchlsmth approved these changes Jul 3, 2025

View reviewed changes

minosfuture force-pushed the fix_maverick_correctness branch from 85c201e to b2c0bef Compare July 3, 2025 17:50

mergify bot removed the needs-rebase label Jul 3, 2025

minosfuture requested a review from bnellnm July 3, 2025 17:50

minosfuture added 7 commits July 3, 2025 15:03

[Bugfix] Fix Maverick correctness by filling zero to cache space

95e72e1

Signed-off-by: Ming Yang <[email protected]>

Add unit test case that would fail without filling zeros to c1

ff1f706

Signed-off-by: Ming Yang <[email protected]>

Address comments: func extraction; check expert_map; larger m

a80f42b

Signed-off-by: Ming Yang <[email protected]>

Address comment: fix batched code path as well

d1d6ca4

Signed-off-by: Ming Yang <[email protected]>

Skip filling zero with per_act_token

f97a479

Signed-off-by: Ming Yang <[email protected]>

Add comments and improve condition for zero-out

9b8a110

Signed-off-by: Ming Yang <[email protected]>

remove duplicated fix after rebase

6c20e94

Signed-off-by: Ming Yang <[email protected]>

minosfuture force-pushed the fix_maverick_correctness branch from b2c0bef to 6c20e94 Compare July 3, 2025 22:07

tlrmchlsmth approved these changes Jul 7, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 7, 2025

tlrmchlsmth enabled auto-merge (squash) July 7, 2025 19:40

tlrmchlsmth merged commit afb7cff into vllm-project:main Jul 8, 2025
77 checks passed

huydhn pushed a commit to huydhn/vllm that referenced this pull request Jul 8, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in c…

f9dc019

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in c…

fe4269d

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in c…

105e986

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in c…

3c16016

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]> Signed-off-by: avigny <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in c…

20137cd

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Bugfix] Fix Maverick correctness by filling zero to cache space in c…

757e080

…utlass_moe (vllm-project#20167) Signed-off-by: Ming Yang <[email protected]>

Uh oh!

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

Uh oh!

Conversation

minosfuture commented Jun 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 3, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minosfuture commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

minosfuture commented Jun 27, 2025 •

edited by github-actions bot

Loading