[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137

s3woz · 2025-06-26T15:46:23Z

Purpose

We noticed that GraniteMoeHybrid produces wrong outputs when executed in TensorParallel mode. This bugfix changes the attention implementation to a correct TensorParallel version which avoids discrepancies at the outputs and passes the tests/models/language/generation/test_hybrid.py::test_distributed_correctness test now.
Along with this PR, we also:

enable vLLM tests from tests/models/language/generation/test_hybrid.py to run for GraniteMoeHybrid (it was disabled due to missing HF Transformers support, but it became supported in HF Transformers v4.52.1)
we remove tests/models/language/generation/test_granitemoehybrid.py as it duplicated more generic functionality already present in tests/models/language/generation/test_hybrid.py

Test Plan

TensorParallel test validating the bugfix:

# NOTE: Temporarily modify test_hybrid.py Line 26 to HYBRID_MODELS=["ibm-granite/granite-4.0-tiny-preview"] 
# as only HYBRID_MODELS[0] is used in the test as explained in Line 12.
CUDA_VISIBLE_DEVICES=0,1 pytest ./tests/models/language/generation/test_hybrid.py::test_distributed_correctness -k granite

Show that the model is included in the standard tests:

pytest ./tests/models/language/generation/test_hybrid.py --collect-only -k granite

Test Result

Before bugfix test 1. was failing and now it succeeds.
Previously no tests were collected for Granite, and currently two standard tests are collected: test_models and test_batching.

github-actions · 2025-06-26T15:46:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @s3woz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily resolves a critical bug affecting the GraniteMoeHybrid model's output correctness during TensorParallel execution by implementing a proper TensorParallel-aware attention mechanism. Beyond the core bugfix, I've also streamlined the test suite by enabling the model for existing vLLM tests and removing a duplicate test file, ensuring comprehensive and efficient validation.

Highlights

Bugfix for TensorParallel Correctness: I've addressed a critical bug in the GraniteMoeHybrid model that caused incorrect outputs when executed in TensorParallel mode. The attention implementation has been refactored to ensure correct behavior under distributed execution, passing the test_distributed_correctness.
Attention Layer Refactoring: The individual q_proj, k_proj, and v_proj linear layers within GraniteMoeHybridAttention have been consolidated into a single QKVParallelLinear layer (qkv_proj). Additionally, the o_proj layer now utilizes RowParallelLinear for improved efficiency and correctness in a TensorParallel setup.
Weight Loading Adaptation: The model's weight loading mechanism has been updated to correctly map and load parameters for the new QKVParallelLinear and RowParallelLinear layers, ensuring compatibility with existing pre-trained weights.
Test Suite Enhancement and Cleanup: The GraniteMoeHybrid model has been re-enabled for vLLM tests in test_hybrid.py, leveraging recent HF Transformers support. Concurrently, the redundant test_granitemoehybrid.py file has been removed, as its functionality is now covered by the more generic test_hybrid.py.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces changes to the GraniteMoeHybrid model to correctly handle TensorParallel execution, including modifications to the attention implementation and updates to relevant tests. The changes involve incorporating QKVParallelLinear and RowParallelLinear, adjusting head dimensions for tensor parallelism, and modifying weight loading logic. The code appears well-structured and addresses the identified bug. The test plan seems adequate, covering both correctness and integration aspects.

gemini-code-assist · 2025-06-26T15:48:14Z

vllm/model_executor/models/granitemoehybrid.py

+                loaded = False
+                for param_name, weight_name, shard_id in stacked_params_mapping:
+                    if weight_name in n:
+                        _load_shard(n.replace(weight_name, param_name),
+                                    p,
+                                    shard_id=shard_id)
+                        loaded = True
+                if not loaded:
+                    _load(n, p)


The loaded = False variable is initialized but might not be updated in all branches of the else block. If none of the if weight_name in n: conditions are met, loaded will remain False, and _load(n, p) will be called. However, if any of the conditions are met, loaded is set to True, and _load(n, p) is skipped. It's better to ensure that loaded is correctly set in all branches to avoid unexpected behavior. Consider adding a default else clause within the for loop to set loaded = False explicitly.

Suggested change

loaded = False

for param_name, weight_name, shard_id in stacked_params_mapping:

if weight_name in n:

_load_shard(n.replace(weight_name, param_name),

p,

shard_id=shard_id)

loaded = True

if not loaded:

_load(n, p)

loaded = False

for param_name, weight_name, shard_id in stacked_params_mapping:

if weight_name in n:

_load_shard(n.replace(weight_name, param_name),

p,

shard_id=shard_id)

loaded = True

else:

loaded = False # Add this line

if not loaded:

DarkLight1337

Thanks, let's see if the test passes

tdoublep · 2025-06-27T11:16:31Z

The test is failing, but it looks like it is the HF output that is wrong:

FAILED models/language/generation/test_hybrid.py::test_models[5-64-ibm-granite/granite-4.0-tiny-preview] - AssertionError: Test0:
[2025-06-27T05:58:42Z] Matched tokens:	[203]
[2025-06-27T05:58:42Z] hf:	'\n the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the'	{322: -0.06545200943946838, 34: -3.5654520988464355, 30: -3.5654520988464355, 432: -6.0654520988464355, 32: -6.3154520988464355}
[2025-06-27T05:58:42Z] vllm-v0:	'\n### Key Features:\n\n1. **High-Throughput Inference:**\n   - **Multi-GPU Support:** Supports distributed training and inference across multiple GPUs, enabling faster processing of large-scale models.\n   - **Batching:** Efficiently handles multiple inputs simultaneously, maximizing GPU utilization.\n\n2'	{1482: Logprob(logprob=-2.2187652587890625, rank=1, decoded_token='###'), 433: Logprob(logprob=-2.3437652587890625, rank=2, decoded_token='##'), 1318: Logprob(logprob=-2.5937652587890625, rank=3, decoded_token='The'), 705: Logprob(logprob=-2.5937652587890625, rank=4, decoded_token='To'), 10921: Logprob(logprob=-2.9687652587890625, rank=5, decoded_token='Here')}

The tests pass locally for me (on both H100 and L4 GPU) when I do not have the mamba_ssm and causal_conv1d packages installed. This leads to the following message when running HF baseline

The fast path is not available because on of `(selective_state_update, causal_conv1d_fn, causal_conv1d_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d

When I install those two packages in the same way as vLLM CI:

uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/[email protected]"
pip install 'git+https://github.com/Dao-AILab/[email protected]'

Then I get the same failure. It looks like there is an issue with the "fast path" in transformers provided by those two packages.

Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work.

tdoublep · 2025-06-27T11:23:57Z

fwiw there is also an open PR to remove the mamba_ssm dependency:
#20047

Signed-off-by: Stanislaw Wozniak <[email protected]>

DarkLight1337 · 2025-06-27T13:03:16Z

Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work.

Can we force the HF implementation to use the slow path for now?

…s issue. Signed-off-by: Stanislaw Wozniak <[email protected]>

s3woz · 2025-06-27T13:21:39Z

Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work.

Can we force the HF implementation to use the slow path for now?

I don't know any easy way to enforce slow path for HF transformers.
BTW, the fast path is fixed in HF transformers main ( huggingface/transformers#39033 ), but it's pending a release. After consulting with @tdoublep , I've commented out the failing test and added a comment about the situation in the source code.

DarkLight1337 · 2025-06-27T13:32:27Z

Alright, that sounds good to me

…ution (vllm-project#20137) Signed-off-by: Stanislaw Wozniak <[email protected]>

…ution (vllm-project#20137) Signed-off-by: Stanislaw Wozniak <[email protected]> Signed-off-by: avigny <[email protected]>

s3woz requested review from DarkLight1337 and ywang96 as code owners June 26, 2025 15:46

gemini-code-assist bot reviewed Jun 26, 2025

View reviewed changes

DarkLight1337 approved these changes Jun 27, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) June 27, 2025 03:56

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 27, 2025

GraniteMoeHybrid TensorParallel fix, and enabling in tests

3e16096

Signed-off-by: Stanislaw Wozniak <[email protected]>

Rebasing to latest and disabling a test failing due to HF transformer…

bba31f1

…s issue. Signed-off-by: Stanislaw Wozniak <[email protected]>

auto-merge was automatically disabled June 27, 2025 13:17
Head branch was pushed to by a user without write access

s3woz force-pushed the granitemoehybrid branch from 516b4fc to bba31f1 Compare June 27, 2025 13:17

DarkLight1337 enabled auto-merge (squash) June 27, 2025 13:32

vllm-bot merged commit daec9de into vllm-project:main Jun 28, 2025
71 of 73 checks passed

CSWYF3634076 pushed a commit to CSWYF3634076/vllm that referenced this pull request Jul 2, 2025

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel exec…

e98f27c

…ution (vllm-project#20137) Signed-off-by: Stanislaw Wozniak <[email protected]>

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel exec…

b109251

…ution (vllm-project#20137) Signed-off-by: Stanislaw Wozniak <[email protected]> Signed-off-by: avigny <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137

Uh oh!

s3woz commented Jun 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 26, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

tdoublep commented Jun 27, 2025 •

edited

Loading

Uh oh!

tdoublep commented Jun 27, 2025

Uh oh!

DarkLight1337 commented Jun 27, 2025

Uh oh!

s3woz commented Jun 27, 2025

Uh oh!

DarkLight1337 commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137

[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137

Uh oh!

Conversation

s3woz commented Jun 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Jun 27, 2025

Uh oh!

DarkLight1337 commented Jun 27, 2025

Uh oh!

s3woz commented Jun 27, 2025

Uh oh!

DarkLight1337 commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

s3woz commented Jun 26, 2025 •

edited by github-actions bot

Loading

tdoublep commented Jun 27, 2025 •

edited

Loading