-
Notifications
You must be signed in to change notification settings - Fork 188
[Tracing] Support tracing of Gemma3 [#1248] #1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracing] Support tracing of Gemma3 [#1248] #1373
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Signed-off-by: Kelvin Cheng <[email protected]>
…arity Signed-off-by: Kelvin Cheng <[email protected]>
## Purpose ## * Add better exception messages when encountering tracing errors ## Example ## * Below is an example of a potential tracing runtime error (this particular error was forced for demonstration purposes) ```` Traceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 45, in forward outputs = forward_fn(*args, **kwargs) File "<string>", line 12, in forward TypeError: iter(v, w): v must be callable The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/modifiers/quantization/gptq/base.py", line 234, in on_initialize run_sequential( File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/pipeline.py", line 67, in run_pipeline subgraph.forward(model, **inputs) File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 47, in forward raise RuntimeError( RuntimeError: Raised an exception during execution of the following code: ``` 1 2 3 4 def forward(self, input_ids : torch.Tensor, attention_mask : torch.Tensor): 5 model_rotary_emb_inv_freq = self.model.rotary_emb.inv_freq 6 getitem_10 = model_rotary_emb_inv_freq[(None, slice(None, None, None), None)]; model_rotary_emb_inv_freq = None 7 model_embed_tokens = self.model.embed_tokens(input_ids); input_ids = None 8 size_3 = attention_mask.size(); size_3 = None 9 dim = attention_mask.dim() 10 size_6 = attention_mask.size() 11 getitem_8 = attention_mask[(slice(None, None, None), None, None, slice(None, None, None))] 12 iter_6 = iter(attention_mask, 'device'); attention_mask = None 13 float_1 = getitem_10.float(); getitem_10 = None 14 size = model_embed_tokens.size() 15 iter_1 = iter(model_embed_tokens, 'device') ``` ```` ## Changes ## * Move forward call to inside Subgraph class and wrap forward call in order to catch and propagate exceptions --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Kelvin Cheng <[email protected]>
49e7e84
to
0e322b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some relevant improvements to the tracing system. In the meantime, this is good to land, great job!
Signed-off-by: Kelvin Cheng <[email protected]>
# Summary - Fix device_map and set torch.dtype for the given model - Move tests to a folder which makes more sense
…vllm-project#1328) SUMMARY: Fixed issue vllm-project#1319 where Recipe.model_dump() output couldn't be used with Recipe.model_validate(). Implemented an override of the model_dump() method to ensure it produces output in the format expected by validation, enabling proper round-trip serialization using standard Pydantic methods. TEST PLAN: Created test cases to verify fix works with both simple and complex recipes Confirmed Recipe.model_validate(recipe.model_dump()) succeeds with various recipe formats Validated that recipes with multiple stages having the same group name serialize/deserialize correctly Ensured existing YAML serialization pathways continue to work as expected --------- Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]>
…oject#1377) This PR adds comprehensive documentation for the compression parameters available in the enhanced `save_pretrained` method. These parameters are critical for users working with model compression but were previously undocumented. ## Changes - Adds a new `docs/save_pretrained.md` file explaining: - How the enhanced `save_pretrained` method works - Detailed descriptions of all compression parameters - Code examples showing common usage patterns - Notes on compatibility with loading compressed models ## Benefits - **Better User Experience:** Users can clearly understand all available options - **Improved Onboarding:** New users can quickly learn how to save compressed models - **Comprehensive Examples:** Shows different approaches for saving models with compression This documentation supports [ticket](https://issues.redhat.com/browse/INFERENG-578) and will help users leverage the full capabilities of the compression functionality in the save process. --------- Signed-off-by: Rahul Tuli <[email protected]>
…oject#1378) This PR reverts commit 998be99 which was merged prematurely. The required base tests were skipped during the original review process. When these tests eventually ran on the main branch, they revealed a failure: https://github.com/vllm-project/llm-compressor/actions/runs/14628792870/job/41046641641 The original PR vllm-project#1328 has been reopened to address the identified issues before resubmitting.
SUMMARY: Add a robustness check to AWQ to exclude mappings where the layer shapes don't align. This is a known issue, just wanted to do it in a different PR because it occurs in a different location in AutoAWQ, and I wanted to keep the initial AWQ PR to be as close to the AutoAWQ implementation so the git history reflects our changes well. TEST PLAN: Added unit test to check default case and edge cases --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>
SUMMARY: Update the `w4a16_actorder_weight.yaml` lmeval config to have the appropriate `scheme` value. This resolves an issue where the timing CSV files had a filename clash resulting in the later test overwriting the first. TEST PLAN: Affected tests have been executed in our internal CI to verify. Signed-off-by: Domenic Barbuzzi <[email protected]>
SUMMARY: - Add a test case for asym awq - Add processing steps for `mit-han-lab/pile-val-backup` - Update incorrect logger.warning call Testing - Generated model which was pushed to the hub successfully and ran using vLLM - Model: https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v1.0-w4a16-asym-awq-e2e/tree/main
SUMMARY: LlamaAttention.forward has an optional `attention_mask` field that has no default (see [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L246)). So `attention_mask=None` must be passed in, otherwise AWQ will error out. The previous check only worked for Python 3.10 and 3.11. This fixes it to be a more general recommended solution that works with Python 3.9 ```python from transformers.models.llama.modeling_llama import LlamaAttention import inspect import typing params = inspect.signature(LlamaAttention.forward).parameters #old check old_check = (params["attention_mask"].annotation._name == "Optional") #new check new_check = (params["attention_mask"].default is inspect.Parameter.empty) print(f"OLD {old_check}, NEW {new_check}") # Python 3.9: OLD False, NEW True # Python 3.11: OLD True, NEW True ``` TEST PLAN: This will resolve the failing e2e test at https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14654995202/job/41128588916#step:15:33208 --------- Signed-off-by: Brian Dellabetta <[email protected]>
SUMMARY: This PR resolves the issues surrounding an Optional parameter passed into a torch module's .forward method during AWQ. Previous attempts to resolve in vllm-project#1384 also added kwargs for parameters passed in positionally later on. This will make the addition to kwargs more strict, only if the annotation indicates if it is an optional field. This hotfix will fail if optional fields are passed in positionally, if typing annotation is `a: int | None` instead of `a: typing.Optional[int]`, or if there is no typehint at all and the field is not provided. It will be addressed with a more general solution soon, see vllm-project#1385 TEST PLAN: New test was run with python 3.9 and passed -- https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14713323963/job/41291028422 Signed-off-by: Brian Dellabetta <[email protected]>
## Purpose ## * Reduce size of llm-compressor package ## Changes ## * Exclude all image files using `MANIFEST.in` Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kelkelcheng Would you be so kind as to merge in these changes? #1399. This adds a gemma3 example and fixes style warnings |
It looks like there may be some tracing assumptions that are broken, as the example fails. Line 36 tracebackTraceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 53, in forward outputs = forward_fn(*args, **kwargs) File "", line 36, in forward TypeError: 'NoneType' object is not subscriptableThe above exception was the direct cause of the following exception: Traceback (most recent call last):
|
Absolutely, and I will try to run the example and see if I can get the same issue |
Signed-off-by: Kyle Sayers <[email protected]>
@kelkelcheng It seems that the bug is related to how an if statement was removed when adding Once the example runs to competition, this should be good to go |
Signed-off-by: Kelvin Cheng <[email protected]>
Signed-off-by: Kelvin Cheng <[email protected]>
Awesome—thanks so much for digging in and testing! |
SUMMARY: Add support for tracing of Gemma3: [issue#1248](#1248). Steps that I have done: 1. Create gemma3.py from HF and update __init__.py. 2. Classes and functions that I modified: 2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward 2.2 Gemma3TextModel: _update_causal_mask, forward, and _prepare_4d_causal_attention_mask_with_cache_position TEST PLAN: Ran: `llmcompressor.trace --model_id google/gemma-3-4b-it --model_class TraceableGemma3ForConditionalGeneration --ignore "lm_head" "re:vision_tower.*" --modality vision` Output: <img width="796" alt="trace_output" src="https://github.com/user-attachments/assets/8f5c9c7d-32a9-4b12-b4b2-10b6a4352846" /> This is my first attempt at solving this issue. It is a fun learning experience and please review it carefully. Gemma3 can go through tracing now, but we might need further tests for the quantization as well. --------- Signed-off-by: Kelvin Cheng <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: Domenic Barbuzzi <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Vedant <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]> Co-authored-by: Domenic Barbuzzi <[email protected]>
SUMMARY:
Add support for tracing of Gemma3: issue#1248.
Steps that I have done:
2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward
2.2 Gemma3TextModel: _update_causal_mask, forward, and _prepare_4d_causal_attention_mask_with_cache_position
TEST PLAN:
Ran:
llmcompressor.trace --model_id google/gemma-3-4b-it --model_class TraceableGemma3ForConditionalGeneration --ignore "lm_head" "re:vision_tower.*" --modality vision
Output:

This is my first attempt at solving this issue. It is a fun learning experience and please review it carefully.
Gemma3 can go through tracing now, but we might need further tests for the quantization as well.