Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302

kzawora-intel · 2025-07-21T11:06:31Z

PR #21131 removed HPU checks for --kv-cache-dtype flag, and now HPU out-of-tree plugin (https://github.com/vllm-project/vllm-gaudi) is unable to use KV cache quantization with INC due to NotImplementedError: VLLM_USE_V1=1 is not supported with --kv-cache-dtype.. This PR adds a check for out-of-tree HPU plugin and allows it to use fp8_inc KV cache quantization.

gemini-code-assist

Code Review

This pull request fixes an issue with kv_cache_dtype handling for the out-of-tree HPU plugin by adding a specific check to allow fp8_inc quantization. The change correctly enables the feature for HPU. My review includes a suggestion to refactor the added logic for better maintainability and to prevent potential future bugs by integrating the new platform check into the existing conditional block.

vllm/engine/arg_utils.py

kzawora-intel · 2025-07-21T11:13:13Z

@xuechendi please take a look

github-actions · 2025-07-21T11:16:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/engine/arg_utils.py

DarkLight1337 · 2025-07-21T14:19:15Z

vllm/platforms/cuda.py

+        if kv_cache_dtype == "auto":
+            return True


No need to check auto since this method is only called otherwise

xuechendi · 2025-07-21T14:20:27Z

@WoosukKwon @mgoin , please help to review.
HPU V1 now supports kv_cache_dtype, while existing codes doesn't provide an assumption that oot device supports kv_cache_dtype.
We want to update that. Thanks

DarkLight1337 · 2025-07-21T14:20:57Z

vllm/platforms/interface.py

+        """
+        Returns if the kv_cache_dtype is supported by the current platform.
+        """
+        return kv_cache_dtype == "auto"


The default should be something like

fp8_attention = self.kv_cache_dtype.startswith("fp8") will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" if fp8_attention and will_use_fa: from vllm.attention.utils.fa_utils import ( flash_attn_supports_fp8) return flash_attn_supports_fp8() return False

Why is this so? This logic is specific to CUDA anyway

I am just following the previous logic strictly.

Feel free to update if the old logic is wrong

vllm/platforms/rocm.py

mgoin · 2025-07-21T19:14:33Z

vllm/platforms/interface.py

+        import vllm.envs as envs
+        fp8_attention = kv_cache_dtype.startswith("fp8")
+        will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"
+        if fp8_attention and will_use_fa:
+            from vllm.attention.utils.fa_utils import flash_attn_supports_fp8
+            return flash_attn_supports_fp8()
+        return False


I think the base class should always return false so it is up to the specific impls to define

Suggested change

import vllm.envs as envs

fp8_attention = kv_cache_dtype.startswith("fp8")

will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"

if fp8_attention and will_use_fa:

from vllm.attention.utils.fa_utils import flash_attn_supports_fp8

return flash_attn_supports_fp8()

return False

return False

@mgoin. Have updated codes based your suggestion

Signed-off-by: Konrad Zawora <[email protected]>

…alse Signed-off-by: Chendi.Xue <[email protected]>

Signed-off-by: Chendi.Xue <[email protected]>

vllm-project/vllm#21302 got merged, we can re-enable kv_cache_dtype now. --------- Signed-off-by: Konrad Zawora <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: qizixi <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: avigny <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: shuw <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: x22x22 <[email protected]>

…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>

kzawora-intel mentioned this pull request Jul 21, 2025

Fix attention API post blocksparse deprecation vllm-project/vllm-gaudi#38

Merged

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

kzawora-intel force-pushed the private/kzawora/kv_cache_dtype_fix branch from 0044f12 to d09dd8a Compare July 21, 2025 11:07

DarkLight1337 reviewed Jul 21, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Jul 21, 2025

xuechendi approved these changes Jul 21, 2025

View reviewed changes

DarkLight1337 reviewed Jul 21, 2025

View reviewed changes

vllm/platforms/rocm.py Outdated Show resolved Hide resolved

DarkLight1337 approved these changes Jul 21, 2025

View reviewed changes

DarkLight1337 added this to the v0.10.0 milestone Jul 21, 2025

DarkLight1337 enabled auto-merge (squash) July 21, 2025 14:35

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025

mgoin reviewed Jul 21, 2025

View reviewed changes

auto-merge was automatically disabled July 21, 2025 20:22
Head branch was pushed to by a user without write access

xuechendi force-pushed the private/kzawora/kv_cache_dtype_fix branch from 19f684e to 2f60ad2 Compare July 21, 2025 20:22

kzawora-intel and others added 7 commits July 22, 2025 01:23

Fix kv_cache_dtype handling for out-of-tree HPU plugin

2509468

Signed-off-by: Konrad Zawora <[email protected]>

thank u gemini

71ae2dc

Signed-off-by: Konrad Zawora <[email protected]>

add is_kv_cache_dtype_supported method to platform interface

81bac7c

Signed-off-by: Konrad Zawora <[email protected]>

remove unneeded if

a80bea9

Signed-off-by: Konrad Zawora <[email protected]>

address CR

d87c1d0

Signed-off-by: Konrad Zawora <[email protected]>

update base class for is_kv_cache_dtype_supported nto return always F…

8967b58

…alse Signed-off-by: Chendi.Xue <[email protected]>

update cuda is_kv_cache_dtype_supported to exactly follow original

ca115a2

Signed-off-by: Chendi.Xue <[email protected]>

xuechendi force-pushed the private/kzawora/kv_cache_dtype_fix branch from 2f60ad2 to ca115a2 Compare July 21, 2025 22:24

vllm-bot merged commit c17231e into vllm-project:main Jul 22, 2025
67 of 69 checks passed

kzawora-intel mentioned this pull request Jul 22, 2025

Restore support for kv_cache_dtype vllm-project/vllm-gaudi#40

Merged

kzawora-intel added a commit to vllm-project/vllm-gaudi that referenced this pull request Jul 22, 2025

Restore support for kv_cache_dtype (#40)

55cb5aa

vllm-project/vllm#21302 got merged, we can re-enable kv_cache_dtype now. --------- Signed-off-by: Konrad Zawora <[email protected]>

elvischenv mentioned this pull request Jul 23, 2025

[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported #21420

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302

Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302

kzawora-intel commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

kzawora-intel commented Jul 21, 2025

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

Uh oh!

DarkLight1337 Jul 21, 2025

Uh oh!

xuechendi commented Jul 21, 2025

Uh oh!

DarkLight1337 Jul 21, 2025

Uh oh!

mgoin Jul 21, 2025

Uh oh!

DarkLight1337 Jul 22, 2025

Uh oh!

DarkLight1337 Jul 22, 2025

Uh oh!

Uh oh!

mgoin Jul 21, 2025

Uh oh!

xuechendi Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302

Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302

Conversation

kzawora-intel commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kzawora-intel commented Jul 21, 2025

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

Uh oh!

DarkLight1337 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Jul 21, 2025

Uh oh!

DarkLight1337 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

xuechendi Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!