Skip to content

[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe #20167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 8, 2025

Conversation

minosfuture
Copy link
Contributor

@minosfuture minosfuture commented Jun 27, 2025

Purpose

#19667 changed the workspace creation from torch.zeros to torch.empty. This ends up causing correctness for models using cutlass_moe, e.g. Maverick in our test case. This PR fixes the correctness issue by explicitly filling zeros in cutlass_moe.

Test Plan

  • lm_eval
  • added ut that would fail without this fix

Test Result

lm_eval results:

local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8081/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.935 ± 0.0175
strict-match 5 exact_match 0.920 ± 0.0192

unit test stability verified:

  • without c1.fill_(0), the following one liner verifies stable failure:
for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null && { echo "shouldn't succeed"; exit 1; } done
  • with c1.fill_(0), the following verifies stable success:
for i in {1..10}; do echo $i; pytest -s tests/kernels/moe/test_cutlass_moe.py  -k "test_run_cutlass_moe_fp8 or test_cutlass_moe_8_bit_EP_large" -v  2>&1 > /dev/null || { echo "should succeed"; exit 1; } done

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @minosfuture, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical correctness bug in the cutlass_moe kernel, which was caused by uninitialized memory in the workspace after a recent change from torch.zeros to torch.empty. My fix ensures that the relevant workspace is explicitly zero-filled, restoring correct behavior for models utilizing this kernel, such as Maverick. I have also added comprehensive unit tests to prevent future regressions and expand test coverage for MoE operations.

Highlights

  • Bug Fix: MoE Correctness: I have addressed a correctness issue in cutlass_moe by explicitly filling the c1 workspace (derived from workspace13) with zeros. This resolves a problem where torch.empty was used for workspace creation, leading to uninitialized memory causing incorrect results for models like Maverick.
  • New Unit Test for Workspace Initialization: I have added a new unit test, test_run_cutlass_moe_fp8, which specifically verifies that the run_cutlass_moe_fp8 function produces identical results whether its internal workspace is initialized with random values or explicitly filled with zeros. This test directly validates the bug fix and ensures future regressions are caught.
  • Expanded Test Coverage: I have introduced test_cutlass_moe_8_bit_EP_large, a new parameterized test case for larger 8-bit MoE configurations, to further enhance the test suite's coverage and robustness.
  • Minor API Correction: I have corrected the argument order for the activation_callable lambda function within the CutlassMoe.apply method to ensure proper function signature matching.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a correctness issue in cutlass_moe by filling zero to cache space. The added unit test validates the fix. I have a suggestion to improve the efficiency of the new unit test.

Copy link
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the unit test to repro the issue!

cc: @bnellnm to also take a look!

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
c1 = _resize_cache(workspace13, (M * topk, N * 2))
c2 = _resize_cache(workspace2, (M * topk, N))
c3 = _resize_cache(workspace13, (M * topk, K))
c1.fill_(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like this should impact both chunking and non chunking path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the non-chunking (batched) path is not impacted because c1 is fully overridden. I don't have solid proof though (I need to look into that code path more). @bnellnm / @ElizaWszola comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally and found that the batched case needs to be cleared also. I think it's probably best to unconditionally zero out c1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. updated. Could you share how to run batched case tests? thx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran test_cutlass_moe.py and test_pplx_cutlass_moe.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the. condition should be expert_map is not None or self.use_batched_format. Batched mode is almost always going to have some garbage space in the tensor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the failure caused by reading in garbage data for dynamic per-tensor quantization?

Does it work in the static per-tensor case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, exactly. And it works, without zero-out, in the static scale case.

@@ -365,3 +369,131 @@ def test_cutlass_moe_8_bit_EP(
cutlass_output,
atol=5e-2,
rtol=1e-2)


@pytest.mark.parametrize("m,n,k,topk", [(1, 8192, 5120, 31)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have some m > 32k

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

from vllm.model_executor.layers.fused_moe.fused_moe import (fused_experts,
fused_topk)
from vllm.model_executor.layers.fused_moe.utils import (
moe_kernel_quantize_input)
from vllm.platforms import current_platform

NUM_EXPERTS = [40, 64]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it help with the working sets at line 38-39?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately no. We can look into this more separately.

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
c1 = _resize_cache(workspace13, (M * topk, N * 2))
c2 = _resize_cache(workspace2, (M * topk, N))
c3 = _resize_cache(workspace13, (M * topk, K))
c1.fill_(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed when we don't use expert_map? In case it's not, can you write a condition for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the expert map matter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only when expert_map is in use, the c1 will not be fully overridden. Note that c1 is resized into (# token globally, 2N), while the actually used space is (# token locally, 2N). When expert_map is in use, the local token size can be smaller.

per_out_channel: bool,
ep_size: int,
):
current_platform.seed_everything(7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the body of this is the same as test_cutlass_moe_8_bit_EP can you factor it out into a common function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! thx. Updated with refactoring a few similar test functions here.

):
current_platform.seed_everything(7)
monkeypatch.setenv("VLLM_FUSED_MOE_CHUNK_SIZE", "8192")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very git-friendly, but note this line is not removed during refactoring. see test_cutlass_moe_8_bit_no_graph

Copy link

mergify bot commented Jul 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @minosfuture.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 3, 2025
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions but good to land once rebased. Thanks for the fix

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
c1 = _resize_cache(workspace13, (M * topk, N * 2))
c2 = _resize_cache(workspace2, (M * topk, N))
c3 = _resize_cache(workspace13, (M * topk, K))
c1.fill_(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the expert map matter here?

@@ -176,6 +176,7 @@ def run_cutlass_moe_fp8(
c1 = _resize_cache(workspace13, (M * topk, N * 2))
c2 = _resize_cache(workspace2, (M * topk, N))
c3 = _resize_cache(workspace13, (M * topk, K))
c1.fill_(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the failure caused by reading in garbage data for dynamic per-tensor quantization?

Does it work in the static per-tensor case?

@minosfuture minosfuture force-pushed the fix_maverick_correctness branch from 85c201e to b2c0bef Compare July 3, 2025 17:50
@mergify mergify bot removed the needs-rebase label Jul 3, 2025
@minosfuture minosfuture requested a review from bnellnm July 3, 2025 17:50
@minosfuture
Copy link
Contributor Author

@tlrmchlsmth this should be good to go. Could you help trigger CI/auto-merge? thx!

@minosfuture minosfuture force-pushed the fix_maverick_correctness branch from b2c0bef to 6c20e94 Compare July 3, 2025 22:07
@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 7, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) July 7, 2025 19:40
@tlrmchlsmth tlrmchlsmth merged commit afb7cff into vllm-project:main Jul 8, 2025
77 checks passed
huydhn pushed a commit to huydhn/vllm that referenced this pull request Jul 8, 2025
Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025
patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants