[AWQ] Support accumulation for reduced memory usage #1435

kylesayrs · 2025-05-15T19:02:32Z

Note

@brian-dellabetta updated summary here (took over this PR from @kylesayrs )

Summary

This update removes the Catcher logic ported from AutoAWQ, instead using the SequentialPipeline features and a couple hooks to cache args into module forward passes, as needed to run AWQ. Should cause no significant change in results, but this should be a more accurate implementation because kwargs are cached for each parent layer, rather than re-using those of the first module.

Leveraging IntermediatesCache for cached values, this also exposes a new offload_cache bool value on AWQModifier, if set to True it will offload cached values at the expense of slower runtime. With meta-llama/Llama-2-7b-hf, offloading decreases max GPU memory from ~27GB to ~20GB, at the cost of apply_smoothing taking ~17 seconds per iteration as opposed to ~5 seconds. Because of this, I am leaving the default to not offload, just noting in the docstring to toggle this if users encounter OOM errors.

Test Plan

Confirmed that these changes don't significantly alter PPL scores relative to current implementation on main.

meta-llama/Llama-3.2-3B-Instruct
- PPL 14.1523 on main, 14.081 on this branch
Qwen/Qwen2.5-7B-Instruct
- PPL 10.411 on main, 10.736 on this branch
meta-llama/Llama-2-7b-hf
- PPL 9.5075 on main, 9.503 on this branch

src/llmcompressor/modifiers/awq/base.py

rahul-tuli

These deletions make me happy! LGTM!

src/llmcompressor/modifiers/awq/helpers.py

src/llmcompressor/pipelines/cache.py

SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves #1409 Resolves #1369 Related to #1383 Related to #1406 Related to #1368 Related to #1410 More improvements split into #1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>

brian-dellabetta

approving a PR I contributed to

src/llmcompressor/modifiers/awq/base.py

src/llmcompressor/pipelines/cache.py

src/llmcompressor/modifiers/awq/base.py

kylesayrs

Sweet!

rahul-tuli

So much cleaner ❤️

kylesayrs

Please change this name

src/llmcompressor/modifiers/awq/base.py

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta

also approved changes with @kylesayrs , co-owner of this PR

rahul-tuli

src/llmcompressor/modifiers/awq/base.py

SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves vllm-project#1409 Resolves vllm-project#1369 Related to vllm-project#1383 Related to vllm-project#1406 Related to vllm-project#1368 Related to vllm-project#1410 More improvements split into vllm-project#1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](vllm-project#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>

@brian-dellabetta

> [!NOTE] > @brian-dellabetta updated summary here (took over this PR from @kylesayrs ) ### Summary This update removes the `Catcher` logic ported from AutoAWQ, instead using the SequentialPipeline features and a couple hooks to cache args into module forward passes, as needed to run AWQ. Should cause no significant change in results, but this should be a more accurate implementation because kwargs are cached for each parent layer, rather than re-using those of the first module. Leveraging `IntermediatesCache` for cached values, this also exposes a new `offload_cache` bool value on `AWQModifier`, if set to True it will offload cached values at the expense of slower runtime. With `meta-llama/Llama-2-7b-hf`, offloading decreases max GPU memory from ~27GB to ~20GB, at the cost of `apply_smoothing` taking ~17 seconds per iteration as opposed to ~5 seconds. Because of this, I am leaving the default to not offload, just noting in the docstring to toggle this if users encounter OOM errors. ### Test Plan Confirmed that these changes don't significantly alter PPL scores relative to current implementation on main. - `meta-llama/Llama-3.2-3B-Instruct` - PPL 14.1523 on main, 14.081 on this branch - `Qwen/Qwen2.5-7B-Instruct` - PPL 10.411 on main, 10.736 on this branch - `meta-llama/Llama-2-7b-hf` - PPL 9.5075 on main, 9.503 on this branch --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]>

brian-dellabetta reviewed May 15, 2025

View reviewed changes

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

rahul-tuli reviewed May 15, 2025

View reviewed changes

src/llmcompressor/modifiers/awq/helpers.py Outdated Show resolved Hide resolved

src/llmcompressor/pipelines/cache.py Outdated Show resolved Hide resolved

kylesayrs mentioned this pull request May 15, 2025

AWQ QuantizationMixin + SequentialPipeline #1426

Merged

1 task

brian-dellabetta force-pushed the bdellabe/awq-quantization-mixin branch from c4cd97c to 2659c22 Compare May 15, 2025 20:45

Base automatically changed from bdellabe/awq-quantization-mixin to main May 15, 2025 21:45

brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from abfd68b to d6ffe8c Compare May 21, 2025 21:55

This was referenced May 22, 2025

AWQ Modifier #1177

Merged

AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

Closed

brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from 6acc231 to 00e63c5 Compare May 27, 2025 21:53

brian-dellabetta changed the title ~~[WIP][AWQ] Support accumulation for reduced memory usage~~ [AWQ] Support accumulation for reduced memory usage May 27, 2025

brian-dellabetta marked this pull request as ready for review May 27, 2025 22:03

brian-dellabetta added the ready When a PR is ready for review label May 27, 2025

brian-dellabetta previously approved these changes May 27, 2025

View reviewed changes

kylesayrs commented May 27, 2025

View reviewed changes

brian-dellabetta dismissed their stale review via a442b02 May 28, 2025 18:23

brian-dellabetta requested a review from rahul-tuli May 28, 2025 21:14

brian-dellabetta reviewed May 28, 2025

View reviewed changes

src/llmcompressor/pipelines/cache.py Outdated Show resolved Hide resolved

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

kylesayrs commented May 28, 2025

View reviewed changes

brian-dellabetta previously approved these changes May 28, 2025

View reviewed changes

brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from cb13e02 to d27172b Compare May 28, 2025 22:05

brian-dellabetta enabled auto-merge (squash) May 28, 2025 22:08

rahul-tuli previously approved these changes May 28, 2025

View reviewed changes

kylesayrs commented May 28, 2025

View reviewed changes

src/llmcompressor/modifiers/awq/base.py Outdated Show resolved Hide resolved

brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from d27172b to c930a0f Compare May 29, 2025 13:58

brian-dellabetta dismissed stale reviews from rahul-tuli and themself via 9b3d310 May 29, 2025 14:42

brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch 2 times, most recently from 8b2c142 to 1f194e5 Compare May 29, 2025 14:44

squashed/rebased

339e471

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta added 14 commits May 29, 2025 10:23

more cleanup

774530c

Signed-off-by: Brian Dellabetta <[email protected]>

failing

4051734

Signed-off-by: Brian Dellabetta <[email protected]>

hook working

fafb6a4

Signed-off-by: Brian Dellabetta <[email protected]>

more cleanup

fbdc8c6

Signed-off-by: Brian Dellabetta <[email protected]>

move cache to torch tensor

b5f3f0d

Signed-off-by: Brian Dellabetta <[email protected]>

running but high-ish ppl?

6eff444

Signed-off-by: Brian Dellabetta <[email protected]>

working; stylefixes

424cedf

Signed-off-by: Brian Dellabetta <[email protected]>

remove sanitize_kwargs unit test

50c1462

Signed-off-by: Brian Dellabetta <[email protected]>

working with IntermediatesCache

dac0010

Signed-off-by: Brian Dellabetta <[email protected]>

offload_cache param

3fca6b0

Signed-off-by: Brian Dellabetta <[email protected]>

update IntermedidatesCache to have optional offload_device

3aefac2

Signed-off-by: Brian Dellabetta <[email protected]>

style fix

6989ce6

Signed-off-by: Brian Dellabetta <[email protected]>

docstring update

14a35e7

Signed-off-by: Brian Dellabetta <[email protected]>

rename to _smooth_activation_means

4bbb71b

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from 1f194e5 to 4bbb71b Compare May 29, 2025 15:23

brian-dellabetta approved these changes May 29, 2025

View reviewed changes

rahul-tuli approved these changes May 29, 2025

View reviewed changes

src/llmcompressor/modifiers/awq/base.py Show resolved Hide resolved

brian-dellabetta merged commit 9439f18 into main May 29, 2025
11 checks passed

brian-dellabetta deleted the kylesayrs/awq-accumulation branch May 29, 2025 16:07

brian-dellabetta mentioned this pull request Jun 24, 2025

[Feature] Log/info/Save/Restore quantization steps #1410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AWQ] Support accumulation for reduced memory usage #1435

[AWQ] Support accumulation for reduced memory usage #1435

Uh oh!

kylesayrs commented May 15, 2025 •

edited by brian-dellabetta

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli left a comment

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Uh oh!

rahul-tuli left a comment

Uh oh!

kylesayrs left a comment •

edited

Loading

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

rahul-tuli left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[AWQ] Support accumulation for reduced memory usage #1435

[AWQ] Support accumulation for reduced memory usage #1435

Uh oh!

Conversation

kylesayrs commented May 15, 2025 • edited by brian-dellabetta Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented May 15, 2025 •

edited by brian-dellabetta

Loading

kylesayrs left a comment •

edited

Loading