-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
[Feature]Add support for models quantized with AutoRound #17850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: wenhuach21 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
please kindly have a review when you are free Regarding the preci, the YAPF checker reformats many files that are unrelated to my PR, what should I do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! Are there examples you could add as smoke tests to validate that it works? Possibly to the CPU runner since there is an ipex quant backend, in addition to the gptq/awq forwarding methods
Thanks for the review. The unit tests will be added in the upcoming commits. |
Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
@mgoin The unit test has been added. I believe the test failure is not related to my PR. If it is, please kindly let me know, and I will fix it soon |
Signed-off-by: wenhuach21 <[email protected]>
You have quite a few failing precommit tests
|
Yes, I just fixed it. I believe it was caused by a recent change in vLLM: #17656 or other prs. It was working fine before that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me otherwise, thanks!
Thanks for the review! |
@mgoin The Also, the preci issue doesn't seem related to my PR, so I assume it's safe to ignore it, right |
No, the precommit issue is required to fix and not failing on main. To run the full CI I have to add the ready label but did not yet because of that issue. I can look at fixing it in a bit if you can't figure it out |
Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
Got it, thanks for the reply! I’ve figured out the root cause, the preci check now passes. |
Signed-off-by: wenhuach21 <[email protected]>
@mgoin the recent unit test failures don't appear to be related to this PR. Could you please help double-check? If that's the case, would it be possible to ignore them or provide some guidance on how to fix them? FAILED quantization/test_bitsandbytes.py::test_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model
FAILED quantization/test_cpu_offload.py::test_cpu_offload_gptq - RuntimeError: Server exited unexpectedly
FAILED quantization/test_cpu_offload.py::test_cpu_offload_awq - RuntimeError: Server exited unexpectedly.
FAILED quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors - AssertionError: Results for model='nm-testing/llama7b-one-shot-2_4-w4a16-marlin24-t' are not the same
weight-loading-multiple-gpu test
| [2025-05-17T05:04:50Z] =============================== warnings summary ===============================
| [2025-05-17T05:04:50Z] ../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
| [2025-05-17T05:04:50Z] /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
| [2025-05-17T05:04:50Z] ref_error: type[Exception] = jsonschema.RefResolutionError,
| [2025-05-17T05:04:50Z]
| [2025-05-17T05:04:50Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
| [2025-05-17T05:04:50Z] ======================== 1 skipped, 1 warning in 3.50s =========================
| [2025-05-17T05:04:51Z] === PASSED MODEL: None, mgleize/fairseq2-dummy-Llama-3.2-1B, main ===
| [2025-05-17T05:04:52Z] 🚨 Error: The command exited with status 1
| [2025-05-17T05:04:52Z] user command error: The plugin docker command hook exited with status 1 |
Thanks so much, @mgoin, for your kind review and support! |
…t#17850) Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
…t#17850) Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: minpeter <[email protected]>
Backgournd
This pr is to support models quantized by AutoRound github paper,
AutoRound delivers significantly higher accuracy at extremely low bit-widths (e.g., 2-bit) and offers broader compatibility across models (LLMs and VLMs), quantization formats, and configurations. You can check out our github/paper or this blog post.
AutoRound has been integrated into both pytorch/ao and Hugging Face Transformers. Several Hugging Face Spaces offer models quantized with AutoRound, including OPEA, Kaitchup, and fbaldassarri.
Known issues
Mixed bits support is limited
Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.
Quantized MOE model support is limited
Qwen3-30B-A3B: KeyError: layers.45.mlp.gate.qweight', gptq format has the same issue, while awq reports assert self.quant_method is not None
deepseek-moe-16b-base: The input size is not aligned with the quantized weight shape, or mergedColumnParallelLinear object has no attribute 'weight') , Same issues are exists for awq and gptq
Quantized vlms support is limited
the module names may be different from that of Transformers, this may introduce risk to parse the quantization config correctly
OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc: marlin kernel has issues, need to fallback to gptq kernel
Qwen2.5-VL-7B : auto_round:auto_gptq format failed with marlin and gptq kernel both. gptq model has the similar issue. auto_round:auto_awq and awq format are fine