[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel #17583

nlzy · 2025-05-02T10:48:25Z

As mentioned in #7374 , when using a GPTQ model with desc_act=True and enabling tensor parallelism, it causes the output to become garbled. Additionally, this issue is specific to ROCm and does not occur on NVIDIA GPUs.

To summarize, this bug can only be triggered if all three of the following conditions are met:

Tensor parallelism
GPTQ models with desc_act=True
ROCm platform

The following code reveals that if the model uses desc_act=True and the user enables tensor parallelism, a non-exllama kernel will be used:

vllm/vllm/model_executor/layers/quantization/gptq.py

Lines 165 to 169 in ba41cc9

    
           if (input_size != input_size_per_partition 
        
                   and self.quant_config.group_size != -1): 
        
               # For act-order models, we cannot use Exllama for row parallel layer 
        
               if self.quant_config.desc_act: 
        
                   exllama_state = ExllamaState.UNUSED

The non-exllama kernel contains the following code:

vllm/csrc/quantization/gptq/q_gemm.cu

Lines 1242 to 1247 in ba41cc9

    
           #ifndef USE_ROCM 
        
                 res2 = {}; 
        
           #else 
        
                 res2.x = __half_as_ushort(__float2half(0)); 
        
                 res2.y = __half_as_ushort(__float2half(0)); 
        
           #endif

and

vllm/csrc/quantization/gptq/q_gemm.cu

Lines 1260 to 1265 in ba41cc9

    
           #ifndef USE_ROCM 
        
                 res[m] = __hadd(res[m], __hadd(res2.x, res2.y)); 
        
           #else 
        
                 res[m] = __hadd( 
        
                     res[m], __hadd(__ushort_as_half(res2.x), __ushort_as_half(res2.y))); 
        
           #endif

The above code behaves differently on CUDA and ROCm platforms. On CUDA, the logic is straightforward: variables are simply initialized to zero, and basic addition operations are performed.

However, the ROCm code looks unusual. The function __ushort_as_half() converts ushort to half within the type system, does not execute an actual conversion instruction. Meanwhile, this function required a ushort argument, C++ implicit conversion rules will convert the half argument to ushort, which does execute an actual conversion instruction, altering it's value.

After removing this illogical behavior, the issue was resolved.

Additionally, the ROCm compiler does not accept codes like res2 = {};. It will complain about multiple viable candidate functions, leading to ambiguity in the function call. Since the variable res2 is only used within this code block, this PR moves the declaration of res2 inside the block and uses value initialization syntax half2 res2{}; to ensure the variable is initialized to zero.

FIX #7374

Signed-off-by: Yan Cangang <[email protected]>

github-actions · 2025-05-02T10:48:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gshtras

Thanks for the fix, looks like whichever missing feature from old ROCm API the ifdefs were meant to address no longer exists.
cc @tjtanaa to verify, since this ifdef was originally added in #2180

kliuae · 2025-06-17T06:33:38Z

Looks good and thanks for the fix. The conversions were workarounds for older ROCm where implicit casting wasn’t as comprehensive, but they’re no longer needed with the latest versions.

tjtanaa · 2025-06-17T23:54:43Z

Looks good and thanks for the fix. The conversions were workarounds for older ROCm where implicit casting wasn’t as comprehensive, but they’re no longer needed with the latest versions.

Thank you @kliuae (the author of the PR #2180)

cc. @gshtras

gshtras · 2025-06-18T14:47:58Z

@nlzy could you please check the tests? Could be enough to merge from main

- Fix double type conversion bug in q_gemm.cu affecting all GPTQ models with tensor parallelism on ROCm - Move half2 res2 declaration inside loop with proper zero initialization - Remove problematic __half_as_ushort/__ushort_as_half conversions - Fix false Triton flash attention warning for models with sliding window when VLLM_USE_TRITON_FLASH_ATTN=0 - Changes match upstream PR vllm-project#17583 This fixes silent data corruption that was causing GPTQ models to produce gibberish output on ROCm with tensor parallelism. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

nlzy · 2025-06-22T18:34:43Z

Thanks for reviews. I have checked the failed tests in CI, and they should be unrelated to this PR.
@gshtras

DarkLight1337 · 2025-06-23T02:22:44Z

Can you merge from main to fix the CI failures?

[Bugfix] Fix incorrect casting in GPTQ GEMM kernel on ROCm

3f4943a

Signed-off-by: Yan Cangang <[email protected]>

houseroad requested a review from hongxiayang May 5, 2025 15:51

houseroad added the rocm Related to AMD ROCm label May 5, 2025

nlzy mentioned this pull request May 8, 2025

add support for AMD MI25/50/60 #12431

Closed

gshtras approved these changes Jun 9, 2025

View reviewed changes

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2025

Merge branch 'vllm-project:main' into fix-rocm-gptq

ea60466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel #17583

[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel #17583

Uh oh!

nlzy commented May 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 2, 2025

Uh oh!

gshtras left a comment

Uh oh!

kliuae commented Jun 17, 2025

Uh oh!

tjtanaa commented Jun 17, 2025 •

edited

Loading

Uh oh!

gshtras commented Jun 18, 2025

Uh oh!

nlzy commented Jun 22, 2025

Uh oh!

DarkLight1337 commented Jun 23, 2025

Uh oh!

Uh oh!

	if (input_size != input_size_per_partition
	and self.quant_config.group_size != -1):
	# For act-order models, we cannot use Exllama for row parallel layer
	if self.quant_config.desc_act:
	exllama_state = ExllamaState.UNUSED

	#ifndef USE_ROCM
	res2 = {};
	#else
	res2.x = __half_as_ushort(__float2half(0));
	res2.y = __half_as_ushort(__float2half(0));
	#endif

	#ifndef USE_ROCM
	res[m] = __hadd(res[m], __hadd(res2.x, res2.y));
	#else
	res[m] = __hadd(
	res[m], __hadd(__ushort_as_half(res2.x), __ushort_as_half(res2.y)));
	#endif

Uh oh!

[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel #17583

Are you sure you want to change the base?

[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel #17583

Uh oh!

Conversation

nlzy commented May 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 2, 2025

Uh oh!

gshtras left a comment

Choose a reason for hiding this comment

Uh oh!

kliuae commented Jun 17, 2025

Uh oh!

tjtanaa commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras commented Jun 18, 2025

Uh oh!

nlzy commented Jun 22, 2025

Uh oh!

DarkLight1337 commented Jun 23, 2025

Uh oh!

Uh oh!

nlzy commented May 2, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Jun 17, 2025 •

edited

Loading