[Feature]Add support for models quantized with AutoRound #17850

wenhuach21 · 2025-05-08T10:27:42Z

Backgournd

This pr is to support models quantized by AutoRound github paper,

AutoRound delivers significantly higher accuracy at extremely low bit-widths (e.g., 2-bit) and offers broader compatibility across models (LLMs and VLMs), quantization formats, and configurations. You can check out our github/paper or this blog post.

AutoRound has been integrated into both pytorch/ao and Hugging Face Transformers. Several Hugging Face Spaces offer models quantized with AutoRound, including OPEA, Kaitchup, and fbaldassarri.

Known issues

Mixed bits support is limited

Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.

Quantized MOE model support is limited

Qwen3-30B-A3B: KeyError: layers.45.mlp.gate.qweight', gptq format has the same issue, while awq reports assert self.quant_method is not None

deepseek-moe-16b-base: The input size is not aligned with the quantized weight shape, or mergedColumnParallelLinear object has no attribute 'weight') , Same issues are exists for awq and gptq

Quantized vlms support is limited

the module names may be different from that of Transformers, this may introduce risk to parse the quantization config correctly

OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc: marlin kernel has issues, need to fallback to gptq kernel

Qwen2.5-VL-7B : auto_round:auto_gptq format failed with marlin and gptq kernel both. gptq model has the similar issue. auto_round:auto_awq and awq format are fine

Signed-off-by: wenhuach21 <[email protected]>

github-actions · 2025-05-08T10:27:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 · 2025-05-08T11:17:47Z

please kindly have a review when you are free

Regarding the preci, the YAPF checker reformats many files that are unrelated to my PR, what should I do?

mgoin

Exciting! Are there examples you could add as smoke tests to validate that it works? Possibly to the CPU runner since there is an ipex quant backend, in addition to the gptq/awq forwarding methods

vllm/model_executor/layers/quantization/__init__.py

wenhuach21 · 2025-05-13T01:45:02Z

Exciting! Are there examples you could add as smoke tests to validate that it works? Possibly to the CPU runner since there is an ipex quant backend, in addition to the gptq/awq forwarding methods

Thanks for the review. The unit tests will be added in the upcoming commits.

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 · 2025-05-13T08:45:51Z

@mgoin The unit test has been added. I believe the test failure is not related to my PR. If it is, please kindly let me know, and I will fix it soon

Signed-off-by: wenhuach21 <[email protected]>

mgoin · 2025-05-15T02:41:04Z

You have quite a few failing precommit tests

Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.Dict` is deprecated, use `dict` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.List` is deprecated, use `list` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:38:53: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:39:32: UP006 Use `dict` instead of `Dict` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:81:42: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:89:38: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:93:34: UP006 Use `dict` instead of `Dict` for type annotation

diff --git a/tests/quantization/test_auto_round.py b/tests/quantization/test_auto_round.py
index 79bf731..81ceecd 100644
--- a/tests/quantization/test_auto_round.py
+++ b/tests/quantization/test_auto_round.py
@@ -18,7 +18,8 @@ MODELS = [
 
 
 @pytest.mark.skipif(not current_platform.is_cpu()
-                    and not current_platform.is_xpu() and not current_platform.is_cuda(),
+                    and not current_platform.is_xpu()
+                    and not current_platform.is_cuda(),
                     reason="only supports CPU/XPU/CUDA backend.")
 @pytest.mark.parametrize("model", MODELS)
 def test_auto_round(vllm_runner, model):
diff --git a/vllm/model_executor/layers/quantization/auto_round.py b/vllm/model_executor/layers/quantization/auto_round.py
index 52fa6cd..2967713 100644
--- a/vllm/model_executor/layers/quantization/auto_round.py
+++ b/vllm/model_executor/layers/quantization/auto_round.py
@@ -102,8 +102,8 @@ class AutoRoundConfig(QuantizationConfig):
                 None),
             extra_config=cls.get_from_keys_or(config, ["extra_config"], None),
             data_type=cls.get_from_keys_or(config, ["data_type"], "int"),
-            backend=cls.get_from_keys_or(config, ["backend",
-                                                  "vllm_backend"], "auto"),
+            backend=cls.get_from_keys_or(config, ["backend", "vllm_backend"],
+                                         "auto"),
         )
 
     def get_layer_config(self, layer, layer_name: str):
@@ -302,4 +302,3 @@ class AutoRoundConfig(QuantizationConfig):
             return self.apply_gptq_quant_layer(layer, prefix)
         if "awq" in self.packing_format or "awq" in self.backend:
             return self.apply_awq_quant_layer(layer, prefix)
-

wenhuach21 · 2025-05-15T02:44:02Z

You have quite a few failing precommit tests

Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.Dict` is deprecated, use `dict` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.List` is deprecated, use `list` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:38:53: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:39:32: UP006 Use `dict` instead of `Dict` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:81:42: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:89:38: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:93:34: UP006 Use `dict` instead of `Dict` for type annotation

diff --git a/tests/quantization/test_auto_round.py b/tests/quantization/test_auto_round.py
index 79bf731..81ceecd 100644
--- a/tests/quantization/test_auto_round.py
+++ b/tests/quantization/test_auto_round.py
@@ -18,7 +18,8 @@ MODELS = [
 
 
 @pytest.mark.skipif(not current_platform.is_cpu()
-                    and not current_platform.is_xpu() and not current_platform.is_cuda(),
+                    and not current_platform.is_xpu()
+                    and not current_platform.is_cuda(),
                     reason="only supports CPU/XPU/CUDA backend.")
 @pytest.mark.parametrize("model", MODELS)
 def test_auto_round(vllm_runner, model):
diff --git a/vllm/model_executor/layers/quantization/auto_round.py b/vllm/model_executor/layers/quantization/auto_round.py
index 52fa6cd..2967713 100644
--- a/vllm/model_executor/layers/quantization/auto_round.py
+++ b/vllm/model_executor/layers/quantization/auto_round.py
@@ -102,8 +102,8 @@ class AutoRoundConfig(QuantizationConfig):
                 None),
             extra_config=cls.get_from_keys_or(config, ["extra_config"], None),
             data_type=cls.get_from_keys_or(config, ["data_type"], "int"),
-            backend=cls.get_from_keys_or(config, ["backend",
-                                                  "vllm_backend"], "auto"),
+            backend=cls.get_from_keys_or(config, ["backend", "vllm_backend"],
+                                         "auto"),
         )
 
     def get_layer_config(self, layer, layer_name: str):
@@ -302,4 +302,3 @@ class AutoRoundConfig(QuantizationConfig):
             return self.apply_gptq_quant_layer(layer, prefix)
         if "awq" in self.packing_format or "awq" in self.backend:
             return self.apply_awq_quant_layer(layer, prefix)
-

Yes, I just fixed it. I believe it was caused by a recent change in vLLM: #17656 or other prs. It was working fine before that.

mgoin

Seems reasonable to me otherwise, thanks!

wenhuach21 · 2025-05-15T02:51:58Z

Seems reasonable to me otherwise, thanks!

Thanks for the review!

wenhuach21 · 2025-05-16T05:52:01Z

@mgoin The buildkite/ci/pr job has been running for nearly a day and still hasn't completed. Is this expected?

Also, the preci issue doesn't seem related to my PR, so I assume it's safe to ignore it, right

mgoin · 2025-05-16T10:00:42Z

No, the precommit issue is required to fix and not failing on main. To run the full CI I have to add the ready label but did not yet because of that issue. I can look at fixing it in a bit if you can't figure it out

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 · 2025-05-16T11:47:53Z

No, the precommit issue is required to fix and not failing on main. To run the full CI I have to add the ready label but did not yet because of that issue. I can look at fixing it in a bit if you can't figure it out

Got it, thanks for the reply! I’ve figured out the root cause, the preci check now passes.

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 · 2025-05-17T06:06:26Z

@mgoin the recent unit test failures don't appear to be related to this PR. Could you please help double-check? If that's the case, would it be possible to ignore them or provide some guidance on how to fix them?

FAILED quantization/test_bitsandbytes.py::test_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model

FAILED quantization/test_cpu_offload.py::test_cpu_offload_gptq - RuntimeError: Server exited unexpectedly

FAILED quantization/test_cpu_offload.py::test_cpu_offload_awq - RuntimeError: Server exited unexpectedly.

FAILED quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors - AssertionError: Results for model='nm-testing/llama7b-one-shot-2_4-w4a16-marlin24-t' are not the same

weight-loading-multiple-gpu test

  | [2025-05-17T05:04:50Z] =============================== warnings summary ===============================
  | [2025-05-17T05:04:50Z] ../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  | [2025-05-17T05:04:50Z]   /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
  | [2025-05-17T05:04:50Z]     ref_error: type[Exception] = jsonschema.RefResolutionError,
  | [2025-05-17T05:04:50Z]
  | [2025-05-17T05:04:50Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
  | [2025-05-17T05:04:50Z] ======================== 1 skipped, 1 warning in 3.50s =========================
  | [2025-05-17T05:04:51Z] === PASSED MODEL: None, mgleize/fairseq2-dummy-Llama-3.2-1B, main ===
  | [2025-05-17T05:04:52Z] 🚨 Error: The command exited with status 1
  | [2025-05-17T05:04:52Z] user command error: The plugin docker command hook exited with status 1

wenhuach21 · 2025-05-20T01:58:22Z

Thanks so much, @mgoin, for your kind review and support!

…t#17850) Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

…t#17850) Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: minpeter <[email protected]>

add autoround

7b6d4b3

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners May 8, 2025 10:27

wenhuach21 marked this pull request as draft May 8, 2025 10:36

fix typo

e70b10f

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 marked this pull request as ready for review May 8, 2025 10:49

trigger preci

2a35f53

Signed-off-by: wenhuach21 <[email protected]>

mgoin reviewed May 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/__init__.py Show resolved Hide resolved

wenhuach21 added 3 commits May 13, 2025 13:32

add ut

0bd9660

Signed-off-by: wenhuach21 <[email protected]>

merge ut

f3bceae

Signed-off-by: wenhuach21 <[email protected]>

refine a little

d37fdaf

Signed-off-by: wenhuach21 <[email protected]>

wenhuach21 added 2 commits May 15, 2025 10:19

Merge branch 'vllm-project:main' into main

02c3077

fix some preci issues

a41cc6a

Signed-off-by: wenhuach21 <[email protected]>

mgoin approved these changes May 15, 2025

View reviewed changes

mgoin added the quantization label May 15, 2025

wenhuach21 added 2 commits May 16, 2025 19:21

try to fix preci

7048e87

Signed-off-by: wenhuach21 <[email protected]>

try to fix preci

7fe2e32

Signed-off-by: wenhuach21 <[email protected]>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025

logger str

2b15fdc

Signed-off-by: wenhuach21 <[email protected]>

Merge branch 'vllm-project:main' into main

b33fc57

vllm-bot merged commit e2ee1e8 into vllm-project:main May 19, 2025
66 of 69 checks passed

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Feature]Add support for models quantized with AutoRound (vllm-projec…

15bd40d

…t#17850) Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

wenhuach21 mentioned this pull request May 30, 2025

vLLM KeyError: 'layers.0.mlp.gate.qweight' intel/auto-round#586

Closed

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Feature]Add support for models quantized with AutoRound (vllm-projec…

01a657c

…t#17850) Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: minpeter <[email protected]>

tanujtiwari1998 mentioned this pull request Jul 8, 2025

cached tokens completions character-tech/vllm#22

Merged

4 tasks

Uh oh!

[Feature]Add support for models quantized with AutoRound #17850

[Feature]Add support for models quantized with AutoRound #17850

Uh oh!

Conversation

wenhuach21 commented May 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backgournd

Known issues

Mixed bits support is limited

Quantized MOE model support is limited

Quantized vlms support is limited

Uh oh!

github-actions bot commented May 8, 2025

Uh oh!

wenhuach21 commented May 8, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wenhuach21 commented May 13, 2025

Uh oh!

wenhuach21 commented May 13, 2025

Uh oh!

mgoin commented May 15, 2025

Uh oh!

wenhuach21 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

wenhuach21 commented May 15, 2025

Uh oh!

wenhuach21 commented May 16, 2025

Uh oh!

mgoin commented May 16, 2025

Uh oh!

wenhuach21 commented May 16, 2025

Uh oh!

wenhuach21 commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented May 20, 2025

Uh oh!

Uh oh!

wenhuach21 commented May 8, 2025 •

edited by github-actions bot

Loading

wenhuach21 commented May 15, 2025 •

edited

Loading

wenhuach21 commented May 17, 2025 •

edited

Loading