Skip to content

[AWQ] Support accumulation for reduced memory usage #1435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 29, 2025

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented May 15, 2025

Note

@brian-dellabetta updated summary here (took over this PR from @kylesayrs )

Summary

This update removes the Catcher logic ported from AutoAWQ, instead using the SequentialPipeline features and a couple hooks to cache args into module forward passes, as needed to run AWQ. Should cause no significant change in results, but this should be a more accurate implementation because kwargs are cached for each parent layer, rather than re-using those of the first module.

Leveraging IntermediatesCache for cached values, this also exposes a new offload_cache bool value on AWQModifier, if set to True it will offload cached values at the expense of slower runtime. With meta-llama/Llama-2-7b-hf, offloading decreases max GPU memory from ~27GB to ~20GB, at the cost of apply_smoothing taking ~17 seconds per iteration as opposed to ~5 seconds. Because of this, I am leaving the default to not offload, just noting in the docstring to toggle this if users encounter OOM errors.

Test Plan

Confirmed that these changes don't significantly alter PPL scores relative to current implementation on main.

  • meta-llama/Llama-3.2-3B-Instruct
    • PPL 14.1523 on main, 14.081 on this branch
  • Qwen/Qwen2.5-7B-Instruct
    • PPL 10.411 on main, 10.736 on this branch
  • meta-llama/Llama-2-7b-hf
    • PPL 9.5075 on main, 9.503 on this branch

Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These deletions make me happy! LGTM!

@brian-dellabetta brian-dellabetta force-pushed the bdellabe/awq-quantization-mixin branch from c4cd97c to 2659c22 Compare May 15, 2025 20:45
Base automatically changed from bdellabe/awq-quantization-mixin to main May 15, 2025 21:45
brian-dellabetta added a commit that referenced this pull request May 15, 2025
SUMMARY:
- Add QuantizationMixin to AWQModifier so we don't have redundant inputs
(num_bits, symmetric, group_size)
- Move AWQModifier to sequential pipelines, to avoid huge memory
requirements of caching all activations at once.

Regression test results are acceptable, results are all roughly the
same, and within stderr, see test plan below.

Resolves #1409 
Resolves #1369 
Related to #1383
Related to #1406 
Related to #1368 
Related to #1410 

More improvements split into #1435

TEST PLAN:
- [x] Rerun tests to validate
No regression in tests, comparing against those reported in [original
AWQ
PR](#1177 (comment)).
All gsm8k results are within stderr:

| Type            | gsm8k       | wikitext
| ------          | ------      | ----- 
| Old AWQ+QuantModifier Sym          | .1054, .1069     | 9.1931 
| New AWQ+QuantMixin Sym        | .1077, .1084 | 9.1841
| Old AWQ+QuantModifier Asym             | .1274, .1281 | 9.0281
| New AWQ+QuantMixin Asym        | .1312, .1350 | 9.0288

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from abfd68b to d6ffe8c Compare May 21, 2025 21:55
This was referenced May 22, 2025
@brian-dellabetta brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from 6acc231 to 00e63c5 Compare May 27, 2025 21:53
@brian-dellabetta brian-dellabetta changed the title [WIP][AWQ] Support accumulation for reduced memory usage [AWQ] Support accumulation for reduced memory usage May 27, 2025
@brian-dellabetta brian-dellabetta marked this pull request as ready for review May 27, 2025 22:03
@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label May 27, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving a PR I contributed to

Screenshot 2025-05-27 at 5 07 14 PM

Copy link
Collaborator Author

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet!

@brian-dellabetta brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from cb13e02 to d27172b Compare May 28, 2025 22:05
@brian-dellabetta brian-dellabetta enabled auto-merge (squash) May 28, 2025 22:08
rahul-tuli
rahul-tuli previously approved these changes May 28, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much cleaner ❤️

Copy link
Collaborator Author

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this name

@brian-dellabetta brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from d27172b to c930a0f Compare May 29, 2025 13:58
@brian-dellabetta brian-dellabetta dismissed stale reviews from rahul-tuli and themself via 9b3d310 May 29, 2025 14:42
@brian-dellabetta brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch 2 times, most recently from 8b2c142 to 1f194e5 Compare May 29, 2025 14:44
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the kylesayrs/awq-accumulation branch from 1f194e5 to 4bbb71b Compare May 29, 2025 15:23
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also approved changes with @kylesayrs , co-owner of this PR

Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-dellabetta brian-dellabetta merged commit 9439f18 into main May 29, 2025
11 checks passed
@brian-dellabetta brian-dellabetta deleted the kylesayrs/awq-accumulation branch May 29, 2025 16:07
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
SUMMARY:
- Add QuantizationMixin to AWQModifier so we don't have redundant inputs
(num_bits, symmetric, group_size)
- Move AWQModifier to sequential pipelines, to avoid huge memory
requirements of caching all activations at once.

Regression test results are acceptable, results are all roughly the
same, and within stderr, see test plan below.

Resolves vllm-project#1409 
Resolves vllm-project#1369 
Related to vllm-project#1383
Related to vllm-project#1406 
Related to vllm-project#1368 
Related to vllm-project#1410 

More improvements split into vllm-project#1435

TEST PLAN:
- [x] Rerun tests to validate
No regression in tests, comparing against those reported in [original
AWQ
PR](vllm-project#1177 (comment)).
All gsm8k results are within stderr:

| Type            | gsm8k       | wikitext
| ------          | ------      | ----- 
| Old AWQ+QuantModifier Sym          | .1054, .1069     | 9.1931 
| New AWQ+QuantMixin Sym        | .1077, .1084 | 9.1841
| Old AWQ+QuantModifier Asym             | .1274, .1281 | 9.0281
| New AWQ+QuantMixin Asym        | .1312, .1350 | 9.0288

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
> [!NOTE]  
> @brian-dellabetta updated summary here (took over this PR from
@kylesayrs )

### Summary
This update removes the `Catcher` logic ported from AutoAWQ, instead
using the SequentialPipeline features and a couple hooks to cache args
into module forward passes, as needed to run AWQ. Should cause no
significant change in results, but this should be a more accurate
implementation because kwargs are cached for each parent layer, rather
than re-using those of the first module.

Leveraging `IntermediatesCache` for cached values, this also exposes a
new `offload_cache` bool value on `AWQModifier`, if set to True it will
offload cached values at the expense of slower runtime. With
`meta-llama/Llama-2-7b-hf`, offloading decreases max GPU memory from
~27GB to ~20GB, at the cost of `apply_smoothing` taking ~17 seconds per
iteration as opposed to ~5 seconds. Because of this, I am leaving the
default to not offload, just noting in the docstring to toggle this if
users encounter OOM errors.

### Test Plan
Confirmed that these changes don't significantly alter PPL scores
relative to current implementation on main.

- `meta-llama/Llama-3.2-3B-Instruct`
  - PPL 14.1523 on main, 14.081 on this branch
- `Qwen/Qwen2.5-7B-Instruct`
  - PPL 10.411 on main, 10.736 on this branch
- `meta-llama/Llama-2-7b-hf`
  - PPL 9.5075 on main, 9.503 on this branch

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants