-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel #17583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yan Cangang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and thanks for the fix. The conversions were workarounds for older ROCm where implicit casting wasn’t as comprehensive, but they’re no longer needed with the latest versions. |
@nlzy could you please check the tests? Could be enough to merge from main |
- Fix double type conversion bug in q_gemm.cu affecting all GPTQ models with tensor parallelism on ROCm - Move half2 res2 declaration inside loop with proper zero initialization - Remove problematic __half_as_ushort/__ushort_as_half conversions - Fix false Triton flash attention warning for models with sliding window when VLLM_USE_TRITON_FLASH_ATTN=0 - Changes match upstream PR vllm-project#17583 This fixes silent data corruption that was causing GPTQ models to produce gibberish output on ROCm with tensor parallelism. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Thanks for reviews. I have checked the failed tests in CI, and they should be unrelated to this PR. |
Can you merge from main to fix the CI failures? |
As mentioned in #7374 , when using a GPTQ model with
desc_act=True
and enabling tensor parallelism, it causes the output to become garbled. Additionally, this issue is specific to ROCm and does not occur on NVIDIA GPUs.To summarize, this bug can only be triggered if all three of the following conditions are met:
desc_act=True
The following code reveals that if the model uses
desc_act=True
and the user enables tensor parallelism, a non-exllama kernel will be used:vllm/vllm/model_executor/layers/quantization/gptq.py
Lines 165 to 169 in ba41cc9
The non-exllama kernel contains the following code:
vllm/csrc/quantization/gptq/q_gemm.cu
Lines 1242 to 1247 in ba41cc9
and
vllm/csrc/quantization/gptq/q_gemm.cu
Lines 1260 to 1265 in ba41cc9
The above code behaves differently on CUDA and ROCm platforms. On CUDA, the logic is straightforward: variables are simply initialized to zero, and basic addition operations are performed.
However, the ROCm code looks unusual. The function
__ushort_as_half()
convertsushort
tohalf
within the type system, does not execute an actual conversion instruction. Meanwhile, this function required aushort
argument, C++ implicit conversion rules will convert thehalf
argument toushort
, which does execute an actual conversion instruction, altering it's value.After removing this illogical behavior, the issue was resolved.
Additionally, the ROCm compiler does not accept codes like
res2 = {};
. It will complain about multiple viable candidate functions, leading to ambiguity in the function call. Since the variableres2
is only used within this code block, this PR moves the declaration ofres2
inside the block and uses value initialization syntaxhalf2 res2{};
to ensure the variable is initialized to zero.FIX #7374