Skip to content

Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kzawora-intel
Copy link
Contributor

PR #21131 removed HPU checks for --kv-cache-dtype flag, and now HPU out-of-tree plugin (https://github.com/vllm-project/vllm-gaudi) is unable to use KV cache quantization with INC due to NotImplementedError: VLLM_USE_V1=1 is not supported with --kv-cache-dtype.. This PR adds a check for out-of-tree HPU plugin and allows it to use fp8_inc KV cache quantization.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue with kv_cache_dtype handling for the out-of-tree HPU plugin by adding a specific check to allow fp8_inc quantization. The change correctly enables the feature for HPU. My review includes a suggestion to refactor the added logic for better maintainability and to prevent potential future bugs by integrating the new platform check into the existing conditional block.

@kzawora-intel kzawora-intel force-pushed the private/kzawora/kv_cache_dtype_fix branch from 0044f12 to d09dd8a Compare July 21, 2025 11:07
@kzawora-intel
Copy link
Contributor Author

@xuechendi please take a look

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Jul 21, 2025
Comment on lines 591 to 592
if kv_cache_dtype == "auto":
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check auto since this method is only called otherwise

@xuechendi
Copy link
Contributor

@WoosukKwon @mgoin , please help to review.
HPU V1 now supports kv_cache_dtype, while existing codes doesn't provide an assumption that oot device supports kv_cache_dtype.
We want to update that. Thanks

"""
Returns if the kv_cache_dtype is supported by the current platform.
"""
return kv_cache_dtype == "auto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be something like

            fp8_attention = self.kv_cache_dtype.startswith("fp8")
            will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"
            if fp8_attention and will_use_fa:
                from vllm.attention.utils.fa_utils import (
                    flash_attn_supports_fp8)
                return flash_attn_supports_fp8()
            return False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this so? This logic is specific to CUDA anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just following the previous logic strictly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to update if the old logic is wrong

@DarkLight1337 DarkLight1337 added this to the v0.10.0 milestone Jul 21, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) July 21, 2025 14:35
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025
import vllm.envs as envs
fp8_attention = kv_cache_dtype.startswith("fp8")
will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"
if fp8_attention and will_use_fa:
from vllm.attention.utils.fa_utils import flash_attn_supports_fp8
return flash_attn_supports_fp8()
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the base class should always return false so it is up to the specific impls to define

Suggested change
import vllm.envs as envs
fp8_attention = kv_cache_dtype.startswith("fp8")
will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"
if fp8_attention and will_use_fa:
from vllm.attention.utils.fa_utils import flash_attn_supports_fp8
return flash_attn_supports_fp8()
return False
return False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin. Have updated codes based your suggestion

auto-merge was automatically disabled July 21, 2025 20:22

Head branch was pushed to by a user without write access

@xuechendi xuechendi force-pushed the private/kzawora/kv_cache_dtype_fix branch from 19f684e to 2f60ad2 Compare July 21, 2025 20:22
@xuechendi xuechendi force-pushed the private/kzawora/kv_cache_dtype_fix branch from 2f60ad2 to ca115a2 Compare July 21, 2025 22:24
@vllm-bot vllm-bot merged commit c17231e into vllm-project:main Jul 22, 2025
67 of 69 checks passed
kzawora-intel added a commit to vllm-project/vllm-gaudi that referenced this pull request Jul 22, 2025
vllm-project/vllm#21302 got merged, we can
re-enable kv_cache_dtype now.

---------

Signed-off-by: Konrad Zawora <[email protected]>
yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jul 23, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
zixi-qi pushed a commit to zixi-qi/vllm that referenced this pull request Jul 23, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: qizixi <[email protected]>
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: avigny <[email protected]>
wenscarl pushed a commit to wenscarl/vllm that referenced this pull request Aug 4, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: shuw <[email protected]>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: x22x22 <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
…21302)

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants