-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kv_cache_dtype handling for out-of-tree HPU plugin #21302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue with kv_cache_dtype
handling for the out-of-tree HPU plugin by adding a specific check to allow fp8_inc
quantization. The change correctly enables the feature for HPU. My review includes a suggestion to refactor the added logic for better maintainability and to prevent potential future bugs by integrating the new platform check into the existing conditional block.
0044f12
to
d09dd8a
Compare
@xuechendi please take a look |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/platforms/cuda.py
Outdated
if kv_cache_dtype == "auto": | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check auto
since this method is only called otherwise
@WoosukKwon @mgoin , please help to review. |
vllm/platforms/interface.py
Outdated
""" | ||
Returns if the kv_cache_dtype is supported by the current platform. | ||
""" | ||
return kv_cache_dtype == "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be something like
fp8_attention = self.kv_cache_dtype.startswith("fp8")
will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"
if fp8_attention and will_use_fa:
from vllm.attention.utils.fa_utils import (
flash_attn_supports_fp8)
return flash_attn_supports_fp8()
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this so? This logic is specific to CUDA anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just following the previous logic strictly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to update if the old logic is wrong
vllm/platforms/interface.py
Outdated
import vllm.envs as envs | ||
fp8_attention = kv_cache_dtype.startswith("fp8") | ||
will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" | ||
if fp8_attention and will_use_fa: | ||
from vllm.attention.utils.fa_utils import flash_attn_supports_fp8 | ||
return flash_attn_supports_fp8() | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the base class should always return false so it is up to the specific impls to define
import vllm.envs as envs | |
fp8_attention = kv_cache_dtype.startswith("fp8") | |
will_use_fa = envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" | |
if fp8_attention and will_use_fa: | |
from vllm.attention.utils.fa_utils import flash_attn_supports_fp8 | |
return flash_attn_supports_fp8() | |
return False | |
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin. Have updated codes based your suggestion
Head branch was pushed to by a user without write access
19f684e
to
2f60ad2
Compare
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
…alse Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
2f60ad2
to
ca115a2
Compare
vllm-project/vllm#21302 got merged, we can re-enable kv_cache_dtype now. --------- Signed-off-by: Konrad Zawora <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: qizixi <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: avigny <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: shuw <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: x22x22 <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>
…21302) Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Chendi.Xue <[email protected]>
PR #21131 removed HPU checks for
--kv-cache-dtype
flag, and now HPU out-of-tree plugin (https://github.com/vllm-project/vllm-gaudi) is unable to use KV cache quantization with INC due toNotImplementedError: VLLM_USE_V1=1 is not supported with --kv-cache-dtype.
. This PR adds a check for out-of-tree HPU plugin and allows it to use fp8_inc KV cache quantization.