-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution #20137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @s3woz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily resolves a critical bug affecting the GraniteMoeHybrid
model's output correctness during TensorParallel
execution by implementing a proper TensorParallel-aware attention mechanism. Beyond the core bugfix, I've also streamlined the test suite by enabling the model for existing vLLM tests and removing a duplicate test file, ensuring comprehensive and efficient validation.
Highlights
- Bugfix for TensorParallel Correctness: I've addressed a critical bug in the
GraniteMoeHybrid
model that caused incorrect outputs when executed inTensorParallel
mode. The attention implementation has been refactored to ensure correct behavior under distributed execution, passing thetest_distributed_correctness
. - Attention Layer Refactoring: The individual
q_proj
,k_proj
, andv_proj
linear layers withinGraniteMoeHybridAttention
have been consolidated into a singleQKVParallelLinear
layer (qkv_proj
). Additionally, theo_proj
layer now utilizesRowParallelLinear
for improved efficiency and correctness in a TensorParallel setup. - Weight Loading Adaptation: The model's weight loading mechanism has been updated to correctly map and load parameters for the new
QKVParallelLinear
andRowParallelLinear
layers, ensuring compatibility with existing pre-trained weights. - Test Suite Enhancement and Cleanup: The
GraniteMoeHybrid
model has been re-enabled for vLLM tests intest_hybrid.py
, leveraging recent HF Transformers support. Concurrently, the redundanttest_granitemoehybrid.py
file has been removed, as its functionality is now covered by the more generictest_hybrid.py
.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces changes to the GraniteMoeHybrid model to correctly handle TensorParallel execution, including modifications to the attention implementation and updates to relevant tests. The changes involve incorporating QKVParallelLinear and RowParallelLinear, adjusting head dimensions for tensor parallelism, and modifying weight loading logic. The code appears well-structured and addresses the identified bug. The test plan seems adequate, covering both correctness and integration aspects.
loaded = False | ||
for param_name, weight_name, shard_id in stacked_params_mapping: | ||
if weight_name in n: | ||
_load_shard(n.replace(weight_name, param_name), | ||
p, | ||
shard_id=shard_id) | ||
loaded = True | ||
if not loaded: | ||
_load(n, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loaded = False
variable is initialized but might not be updated in all branches of the else
block. If none of the if weight_name in n:
conditions are met, loaded
will remain False
, and _load(n, p)
will be called. However, if any of the conditions are met, loaded
is set to True
, and _load(n, p)
is skipped. It's better to ensure that loaded
is correctly set in all branches to avoid unexpected behavior. Consider adding a default else
clause within the for
loop to set loaded = False
explicitly.
loaded = False | |
for param_name, weight_name, shard_id in stacked_params_mapping: | |
if weight_name in n: | |
_load_shard(n.replace(weight_name, param_name), | |
p, | |
shard_id=shard_id) | |
loaded = True | |
if not loaded: | |
_load(n, p) | |
loaded = False | |
for param_name, weight_name, shard_id in stacked_params_mapping: | |
if weight_name in n: | |
_load_shard(n.replace(weight_name, param_name), | |
p, | |
shard_id=shard_id) | |
loaded = True | |
else: | |
loaded = False # Add this line | |
if not loaded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let's see if the test passes
The test is failing, but it looks like it is the HF output that is wrong:
The tests pass locally for me (on both H100 and L4 GPU) when I do not have the
When I install those two packages in the same way as vLLM CI:
Then I get the same failure. It looks like there is an issue with the "fast path" in transformers provided by those two packages. Proposal for now would be to disable the test for this model while we investigate, but merge these fixes anyway since they seem to work. |
fwiw there is also an open PR to remove the |
Signed-off-by: Stanislaw Wozniak <[email protected]>
Can we force the HF implementation to use the slow path for now? |
…s issue. Signed-off-by: Stanislaw Wozniak <[email protected]>
Head branch was pushed to by a user without write access
I don't know any easy way to enforce slow path for HF transformers. |
Alright, that sounds good to me |
…ution (vllm-project#20137) Signed-off-by: Stanislaw Wozniak <[email protected]>
…ution (vllm-project#20137) Signed-off-by: Stanislaw Wozniak <[email protected]> Signed-off-by: avigny <[email protected]>
Purpose
We noticed that GraniteMoeHybrid produces wrong outputs when executed in TensorParallel mode. This bugfix changes the attention implementation to a correct TensorParallel version which avoids discrepancies at the outputs and passes the
tests/models/language/generation/test_hybrid.py::test_distributed_correctness
test now.Along with this PR, we also:
enable vLLM tests from
tests/models/language/generation/test_hybrid.py
to run for GraniteMoeHybrid (it was disabled due to missing HF Transformers support, but it became supported in HF Transformers v4.52.1)we remove
tests/models/language/generation/test_granitemoehybrid.py
as it duplicated more generic functionality already present intests/models/language/generation/test_hybrid.py
@tdoublep @bohnstingl
Test Plan
Test Result
Before bugfix test 1. was failing and now it succeeds.
Previously no tests were collected for Granite, and currently two standard tests are collected:
test_models
andtest_batching
.