-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Description
Has this been supported or requested before?
- I have checked the GitHub README.
- I have checked the Qwen documentation.
- I have checked the documentation of the related framework.
- I have searched the issues and there is not a similar one.
What is this feature about?
a new quantized model
Proposal
Introduction
I would like that ...
Rational
Implementation of this feature will help the following usecase:
in this vllm-project/vllm#17280 they said they supported fp8 on sm120, and they use this model for test https://huggingface.co/RedHatAI/Qwen2.5-3B-FP8-dynamic. but few days later someone says Qwen3-14B-FP8( https://huggingface.co/Qwen/Qwen3-14B-FP8 ) can't run on rtx pro 6000, after a few studies I found vllm only support group scaled fp8 but not block scaled fp8. but unfortunately all fp8 model provided by Qwen is block scaled. so I hope Qwen will provide group scaled models for deployment on sm120. thanks a lot!!!
Contributions are welcomed
- I am willing to help implement this feature.
NickHes
Metadata
Metadata
Assignees
Labels
No labels