-
Notifications
You must be signed in to change notification settings - Fork 189
[AWQ] Support accumulation for reduced memory usage #1435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These deletions make me happy! LGTM!
c4cd97c
to
2659c22
Compare
SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves #1409 Resolves #1369 Related to #1383 Related to #1406 Related to #1368 Related to #1410 More improvements split into #1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>
abfd68b
to
d6ffe8c
Compare
6acc231
to
00e63c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet!
cb13e02
to
d27172b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much cleaner ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this name
d27172b
to
c930a0f
Compare
9b3d310
8b2c142
to
1f194e5
Compare
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
1f194e5
to
4bbb71b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also approved changes with @kylesayrs , co-owner of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves vllm-project#1409 Resolves vllm-project#1369 Related to vllm-project#1383 Related to vllm-project#1406 Related to vllm-project#1368 Related to vllm-project#1410 More improvements split into vllm-project#1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](vllm-project#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>
> [!NOTE] > @brian-dellabetta updated summary here (took over this PR from @kylesayrs ) ### Summary This update removes the `Catcher` logic ported from AutoAWQ, instead using the SequentialPipeline features and a couple hooks to cache args into module forward passes, as needed to run AWQ. Should cause no significant change in results, but this should be a more accurate implementation because kwargs are cached for each parent layer, rather than re-using those of the first module. Leveraging `IntermediatesCache` for cached values, this also exposes a new `offload_cache` bool value on `AWQModifier`, if set to True it will offload cached values at the expense of slower runtime. With `meta-llama/Llama-2-7b-hf`, offloading decreases max GPU memory from ~27GB to ~20GB, at the cost of `apply_smoothing` taking ~17 seconds per iteration as opposed to ~5 seconds. Because of this, I am leaving the default to not offload, just noting in the docstring to toggle this if users encounter OOM errors. ### Test Plan Confirmed that these changes don't significantly alter PPL scores relative to current implementation on main. - `meta-llama/Llama-3.2-3B-Instruct` - PPL 14.1523 on main, 14.081 on this branch - `Qwen/Qwen2.5-7B-Instruct` - PPL 10.411 on main, 10.736 on this branch - `meta-llama/Llama-2-7b-hf` - PPL 9.5075 on main, 9.503 on this branch --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]>
Note
@brian-dellabetta updated summary here (took over this PR from @kylesayrs )
Summary
This update removes the
Catcher
logic ported from AutoAWQ, instead using the SequentialPipeline features and a couple hooks to cache args into module forward passes, as needed to run AWQ. Should cause no significant change in results, but this should be a more accurate implementation because kwargs are cached for each parent layer, rather than re-using those of the first module.Leveraging
IntermediatesCache
for cached values, this also exposes a newoffload_cache
bool value onAWQModifier
, if set to True it will offload cached values at the expense of slower runtime. Withmeta-llama/Llama-2-7b-hf
, offloading decreases max GPU memory from ~27GB to ~20GB, at the cost ofapply_smoothing
taking ~17 seconds per iteration as opposed to ~5 seconds. Because of this, I am leaving the default to not offload, just noting in the docstring to toggle this if users encounter OOM errors.Test Plan
Confirmed that these changes don't significantly alter PPL scores relative to current implementation on main.
meta-llama/Llama-3.2-3B-Instruct
Qwen/Qwen2.5-7B-Instruct
meta-llama/Llama-2-7b-hf