Skip to content

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 28, 2025

Conversation

s3woz
Copy link
Contributor

@s3woz s3woz commented Jun 26, 2025

Purpose

  1. We noticed that GraniteMoeHybrid produces wrong outputs when executed in TensorParallel mode. This bugfix changes the attention implementation to a correct TensorParallel version which avoids discrepancies at the outputs and passes the tests/models/language/generation/test_hybrid.py::test_distributed_correctness test now.

  2. Along with this PR, we also:

  • enable vLLM tests from tests/models/language/generation/test_hybrid.py to run for GraniteMoeHybrid (it was disabled due to missing HF Transformers support, but it became supported in HF Transformers v4.52.1)

  • we remove tests/models/language/generation/test_granitemoehybrid.py as it duplicated more generic functionality already present in tests/models/language/generation/test_hybrid.py

@tdoublep @bohnstingl

Test Plan

  1. TensorParallel test validating the bugfix:
# NOTE: Temporarily modify test_hybrid.py Line 26 to HYBRID_MODELS=["ibm-granite/granite-4.0-tiny-preview"] 
# as only HYBRID_MODELS[0] is used in the test as explained in Line 12.
CUDA_VISIBLE_DEVICES=0,1 pytest ./tests/models/language/generation/test_hybrid.py::test_distributed_correctness -k granite
  1. Show that the model is included in the standard tests:
pytest ./tests/models/language/generation/test_hybrid.py --collect-only -k granite

Test Result

  1. Before bugfix test 1. was failing and now it succeeds.

  2. Previously no tests were collected for Granite, and currently two standard tests are collected: test_models and test_batching.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @s3woz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily resolves a critical bug affecting the GraniteMoeHybrid model's output correctness during TensorParallel execution by implementing a proper TensorParallel-aware attention mechanism. Beyond the core bugfix, I've also streamlined the test suite by enabling the model for existing vLLM tests and removing a duplicate test file, ensuring comprehensive and efficient validation.

Highlights

  • Bugfix for TensorParallel Correctness: I've addressed a critical bug in the GraniteMoeHybrid model that caused incorrect outputs when executed in TensorParallel mode. The attention implementation has been refactored to ensure correct behavior under distributed execution, passing the test_distributed_correctness.
  • Attention Layer Refactoring: The individual q_proj, k_proj, and v_proj linear layers within GraniteMoeHybridAttention have been consolidated into a single QKVParallelLinear layer (qkv_proj). Additionally, the o_proj layer now utilizes RowParallelLinear for improved efficiency and correctness in a TensorParallel setup.
  • Weight Loading Adaptation: The model's weight loading mechanism has been updated to correctly map and load parameters for the new QKVParallelLinear and RowParallelLinear layers, ensuring compatibility with existing pre-trained weights.
  • Test Suite Enhancement and Cleanup: The GraniteMoeHybrid model has been re-enabled for vLLM tests in test_hybrid.py, leveraging recent HF Transformers support. Concurrently, the redundant test_granitemoehybrid.py file has been removed, as its functionality is now covered by the more generic test_hybrid.py.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces changes to the GraniteMoeHybrid model to correctly handle TensorParallel execution, including modifications to the attention implementation and updates to relevant tests. The changes involve incorporating QKVParallelLinear and RowParallelLinear, adjusting head dimensions for tensor parallelism, and modifying weight loading logic. The code appears well-structured and addresses the identified bug. The test plan seems adequate, covering both correctness and integration aspects.

Comment on lines +490 to +498
loaded = False
for param_name, weight_name, shard_id in stacked_params_mapping:
if weight_name in n:
_load_shard(n.replace(weight_name, param_name),
p,
shard_id=shard_id)
loaded = True
if not loaded:
_load(n, p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loaded = False variable is initialized but might not be updated in all branches of the else block. If none of the if weight_name in n: conditions are met, loaded will remain False, and _load(n, p) will be called. However, if any of the conditions are met, loaded is set to True, and _load(n, p) is skipped. It's better to ensure that loaded is correctly set in all branches to avoid unexpected behavior. Consider adding a default else clause within the for loop to set loaded = False explicitly.

Suggested change
loaded = False
for param_name, weight_name, shard_id in stacked_params_mapping:
if weight_name in n:
_load_shard(n.replace(weight_name, param_name),
p,
shard_id=shard_id)
loaded = True
if not loaded:
_load(n, p)
loaded = False
for param_name, weight_name, shard_id in stacked_params_mapping:
if weight_name in n:
_load_shard(n.replace(weight_name, param_name),
p,
shard_id=shard_id)
loaded = True
else:
loaded = False # Add this line
if not loaded:

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's see if the test passes

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 27, 2025 03:56
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 27, 2025
@tdoublep
Copy link
Member

tdoublep commented Jun 27, 2025

The test is failing, but it looks like it is the HF output that is wrong:

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test0:
[2025-06-27T05:58:42Z] Matched tokens:	[203]
[2025-06-27T05:58:42Z] hf:	'\n the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the'	{322: -0.06545200943946838, 34: -3.5654520988464355, 30: -3.5654520988464355, 432: -6.0654520988464355, 32: -6.3154520988464355}
[2025-06-27T05:58:42Z] vllm-v0:	'\n### Key Features:\n\n1. **High-Throughput Inference:**\n   - **Multi-GPU Support:** Supports distributed training and inference across multiple GPUs, enabling faster processing of large-scale models.\n   - **Batching:** Efficiently handles multiple inputs simultaneously, maximizing GPU utilization.\n\n2'	{1482: Logprob(logprob=-2.2187652587890625, rank=1, decoded_token='###'), 433: Logprob(logprob=-2.3437652587890625, rank=2, decoded_token='##'), 1318: Logprob(logprob=-2.5937652587890625, rank=3, decoded_token='The'), 705: Logprob(logprob=-2.5937652587890625, rank=4, decoded_token='To'), 10921: Logprob(logprob=-2.9687652587890625, rank=5, decoded_token='Here')}

The tests pass locally for me (on both H100 and L4 GPU) when I do not have the mamba_ssm and causal_conv1d packages installed. This leads to the following message when running HF baseline

The fast path is not available because on of `(selective_state_update, causal_conv1d_fn, causal_conv1d_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d

When I install those two packages in the same way as vLLM CI:

uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/[email protected]"
pip install 'git+https://github.com/Dao-AILab/[email protected]'

Then I get the same failure. It looks like there is an issue with the "fast path" in transformers provided by those two packages.

Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work.

@tdoublep
Copy link
Member

fwiw there is also an open PR to remove the mamba_ssm dependency:
#20047

@DarkLight1337
Copy link
Member

Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work.

Can we force the HF implementation to use the slow path for now?

auto-merge was automatically disabled June 27, 2025 13:17

Head branch was pushed to by a user without write access

@s3woz s3woz force-pushed the granitemoehybrid branch from 516b4fc to bba31f1 Compare June 27, 2025 13:17
@s3woz
Copy link
Contributor Author

s3woz commented Jun 27, 2025

Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work.

Can we force the HF implementation to use the slow path for now?

I don't know any easy way to enforce slow path for HF transformers.
BTW, the fast path is fixed in HF transformers main ( huggingface/transformers#39033 ), but it's pending a release. After consulting with @tdoublep , I've commented out the failing test and added a comment about the situation in the source code.

@DarkLight1337
Copy link
Member

Alright, that sounds good to me

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 27, 2025 13:32
@vllm-bot vllm-bot merged commit daec9de into vllm-project:main Jun 28, 2025
71 of 73 checks passed
CSWYF3634076 pushed a commit to CSWYF3634076/vllm that referenced this pull request Jul 2, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants