[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints #6761

mgoin · 2024-07-24T21:45:47Z

Enabling kv_cache_dtype=fp8 quantization without scales in an FP8 model checkpoint was broken since the default k/v_scale wasn't saved as a float.

On main you would see this error if enabling fp8 kv cache with an FP8 model:

> vllm serve neuralmagic/Qwen2-0.5B-Instruct-FP8 --enforce-eager --kv-cache-dtype=fp8

[rank0]:   File "/home/mgoin/code/vllm-ct/vllm/model_executor/model_loader/loader.py", line 294, in load_model
[rank0]:     quant_method.process_weights_after_loading(module)
[rank0]:   File "/home/mgoin/code/vllm-ct/vllm/model_executor/layers/quantization/kv_cache.py", line 65, in process_weights_after_loading
[rank0]:     raise ValueError("Only support per-tensor scaling factor "
[rank0]: ValueError: Only support per-tensor scaling factor for fp8 KV cache

github-actions · 2024-07-24T21:46:00Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

…llm-project#6761) Signed-off-by: Alvant <[email protected]>

…llm-project#6761) Signed-off-by: LeiWang1999 <[email protected]>

Fix fp8 kv cache without scales

452bfeb

mgoin changed the title ~~[Bugfix] Fix fp8 kv cache without scales~~ [Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints Jul 24, 2024

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2024

mgoin added 2 commits July 24, 2024 22:05

Add smoke test

fe9648b

Format

eda606e

robertgshaw2-redhat approved these changes Jul 25, 2024

View reviewed changes

mgoin enabled auto-merge (squash) July 25, 2024 14:51

comaniac approved these changes Jul 25, 2024

View reviewed changes

simon-mo disabled auto-merge July 25, 2024 16:46

simon-mo merged commit 65b1f12 into main Jul 25, 2024
70 of 73 checks passed

dtrifiro mentioned this pull request Aug 5, 2024

Sync with [email protected] opendatahub-io/vllm#120

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (v…

ad0773b

…llm-project#6761) Signed-off-by: Alvant <[email protected]>

simon-mo deleted the fix-unloaded-fp8-kv-scales branch October 28, 2024 16:50

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (v…

a99e1c2

…llm-project#6761) Signed-off-by: LeiWang1999 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints #6761

[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints #6761

Uh oh!

mgoin commented Jul 24, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints #6761

[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints #6761

Uh oh!

Conversation

mgoin commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 24, 2024

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints #6761

[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints #6761

mgoin commented Jul 24, 2024 •

edited

Loading