-
Notifications
You must be signed in to change notification settings - Fork 189
[BugFix] Fix quantizaiton_2of4_sparse_w4a16 example #1565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: shanjiaz <[email protected]>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @shanjiaz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a bug in the quantization_2of4_sparse_w4a16
example script. The core issue was that models were not being correctly decompressed when passed from one optimization stage to the next, leading to potential errors or incorrect behavior. The changes introduce explicit model loading and decompression steps between the sparsity, finetuning, and quantization phases, ensuring a correct workflow for the example.
Highlights
- Corrected Model Flow: Implemented explicit loading of models from the
output_dir
after the sparsity and finetuning stages, ensuring that the correct model state is retrieved for subsequent operations. - Inter-Stage Decompression: Introduced a critical decompression step for models loaded after the sparsity and finetuning stages, utilizing
get_model_compressor
anddecompress_model
to properly prepare the models for the next optimization phase. - Output Directory Handling: Ensured the
output_dir
parameter is correctly passed to theoneshot
function's keyword arguments, facilitating proper saving and loading of intermediate models. - Streamlined Saving: Removed redundant
save_pretrained
calls for the final quantized model and tokenizer, as theoneshot
function now implicitly handles saving to the specifiedoutput_dir
.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request adds a decompression step between processing stages in the llama7b_sparse_w4a16.py
example. The model is reloaded from disk after the sparsity and finetuning stages, and then decompressed. The explicit final save calls were removed, which is correct. Refactoring the repeated model reloading and decompression logic into a helper function would improve code maintainability.
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
Outdated
Show resolved
Hide resolved
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
Outdated
Show resolved
Hide resolved
Signed-off-by: shanjiaz <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to rebase your PR
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
Outdated
Show resolved
Hide resolved
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
Outdated
Show resolved
Hide resolved
Signed-off-by: shanjiaz <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
…/vllm-project/llm-compressor into hz-fix-example-quantization-2of4
Signed-off-by: shanjiaz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this keeps reference to model, oneshot_applied_model, and finetune_applied_model. so essentially 3 models are living in memory. same behavior as original example, but just wanted to call it out
Good point! Not sure if there's anything I can do with saving three different models. I'm testing to see if I can skip the manual decompression step at least |
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
Outdated
Show resolved
Hide resolved
Signed-off-by: shanjiaz <[email protected]>
…/vllm-project/llm-compressor into hz-fix-example-quantization-2of4
Signed-off-by: shanjiaz <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
Outdated
Show resolved
Hide resolved
Looks good, just make sure you're using pathlib in a way you expect |
Signed-off-by: shanjiaz <[email protected]>
…/vllm-project/llm-compressor into hz-fix-example-quantization-2of4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@shanjiaz Good job! Have you tested the accuracy of the model produced by this script? I am sad to find that my 14B model's mmlu score dropped a lot (from 72 to 60) after using this script. |
SUMMARY: Pass saved model directly to next stage since now `train` and `oneshot` can properly initialize models from path. TEST PLAN: Testing `test_quantization_2of4_sparse_w4a16` locally ``` collected 2 items tests/examples/test_quantization_2of4_sparse_w4a16.py::TestQuantization24SparseW4A16::test_doc_example_commPASSED tests/examples/test_quantization_2of4_sparse_w4a16.py::TestQuantization24SparseW4A16::test_alternative_recipe PASSED =========================================================== 2 passed in 6123.28s (1:42:03) =========================================================== ``` --------- Signed-off-by: shanjiaz <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>
SUMMARY:
Pass saved model directly to next stage since now
train
andoneshot
can properly initialize models from path.TEST PLAN:
Testing
test_quantization_2of4_sparse_w4a16
locally