[Tracing] Support tracing of Gemma3 [#1248] #1373

kelkelcheng · 2025-04-23T17:45:57Z

SUMMARY:
Add support for tracing of Gemma3: issue#1248.

Steps that I have done:

Create gemma3.py from HF and update init.py.
Classes and functions that I modified:
2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward
2.2 Gemma3TextModel: _update_causal_mask, forward, and _prepare_4d_causal_attention_mask_with_cache_position

TEST PLAN:
Ran:
llmcompressor.trace --model_id google/gemma-3-4b-it --model_class TraceableGemma3ForConditionalGeneration --ignore "lm_head" "re:vision_tower.*" --modality vision

Output:

This is my first attempt at solving this issue. It is a fun learning experience and please review it carefully.
Gemma3 can go through tracing now, but we might need further tests for the quantization as well.

github-actions · 2025-04-23T17:46:10Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

src/llmcompressor/transformers/tracing/gemma3.py

Signed-off-by: Kelvin Cheng <[email protected]>

…arity Signed-off-by: Kelvin Cheng <[email protected]>

## Purpose ## * Add better exception messages when encountering tracing errors ## Example ## * Below is an example of a potential tracing runtime error (this particular error was forced for demonstration purposes) ```` Traceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 45, in forward outputs = forward_fn(*args, **kwargs) File "<string>", line 12, in forward TypeError: iter(v, w): v must be callable The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/modifiers/quantization/gptq/base.py", line 234, in on_initialize run_sequential( File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/pipeline.py", line 67, in run_pipeline subgraph.forward(model, **inputs) File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 47, in forward raise RuntimeError( RuntimeError: Raised an exception during execution of the following code: ``` 1 2 3 4 def forward(self, input_ids : torch.Tensor, attention_mask : torch.Tensor): 5 model_rotary_emb_inv_freq = self.model.rotary_emb.inv_freq 6 getitem_10 = model_rotary_emb_inv_freq[(None, slice(None, None, None), None)]; model_rotary_emb_inv_freq = None 7 model_embed_tokens = self.model.embed_tokens(input_ids); input_ids = None 8 size_3 = attention_mask.size(); size_3 = None 9 dim = attention_mask.dim() 10 size_6 = attention_mask.size() 11 getitem_8 = attention_mask[(slice(None, None, None), None, None, slice(None, None, None))] 12 iter_6 = iter(attention_mask, 'device'); attention_mask = None 13 float_1 = getitem_10.float(); getitem_10 = None 14 size = model_embed_tokens.size() 15 iter_1 = iter(model_embed_tokens, 'device') ``` ```` ## Changes ## * Move forward call to inside Subgraph class and wrap forward call in order to catch and propagate exceptions --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Kelvin Cheng <[email protected]>

kylesayrs

I've added some relevant improvements to the tracing system. In the meantime, this is good to land, great job!

src/llmcompressor/transformers/tracing/gemma3.py

Signed-off-by: Kelvin Cheng <[email protected]>

# Summary - Fix device_map and set torch.dtype for the given model - Move tests to a folder which makes more sense

…vllm-project#1328) SUMMARY: Fixed issue vllm-project#1319 where Recipe.model_dump() output couldn't be used with Recipe.model_validate(). Implemented an override of the model_dump() method to ensure it produces output in the format expected by validation, enabling proper round-trip serialization using standard Pydantic methods. TEST PLAN: Created test cases to verify fix works with both simple and complex recipes Confirmed Recipe.model_validate(recipe.model_dump()) succeeds with various recipe formats Validated that recipes with multiple stages having the same group name serialize/deserialize correctly Ensured existing YAML serialization pathways continue to work as expected --------- Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]>

…oject#1377) This PR adds comprehensive documentation for the compression parameters available in the enhanced `save_pretrained` method. These parameters are critical for users working with model compression but were previously undocumented. ## Changes - Adds a new `docs/save_pretrained.md` file explaining: - How the enhanced `save_pretrained` method works - Detailed descriptions of all compression parameters - Code examples showing common usage patterns - Notes on compatibility with loading compressed models ## Benefits - **Better User Experience:** Users can clearly understand all available options - **Improved Onboarding:** New users can quickly learn how to save compressed models - **Comprehensive Examples:** Shows different approaches for saving models with compression This documentation supports [ticket](https://issues.redhat.com/browse/INFERENG-578) and will help users leverage the full capabilities of the compression functionality in the save process. --------- Signed-off-by: Rahul Tuli <[email protected]>

…oject#1378) This PR reverts commit 998be99 which was merged prematurely. The required base tests were skipped during the original review process. When these tests eventually ran on the main branch, they revealed a failure: https://github.com/vllm-project/llm-compressor/actions/runs/14628792870/job/41046641641 The original PR vllm-project#1328 has been reopened to address the identified issues before resubmitting.

SUMMARY: Add a robustness check to AWQ to exclude mappings where the layer shapes don't align. This is a known issue, just wanted to do it in a different PR because it occurs in a different location in AutoAWQ, and I wanted to keep the initial AWQ PR to be as close to the AutoAWQ implementation so the git history reflects our changes well. TEST PLAN: Added unit test to check default case and edge cases --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

SUMMARY: Update the `w4a16_actorder_weight.yaml` lmeval config to have the appropriate `scheme` value. This resolves an issue where the timing CSV files had a filename clash resulting in the later test overwriting the first. TEST PLAN: Affected tests have been executed in our internal CI to verify. Signed-off-by: Domenic Barbuzzi <[email protected]>

SUMMARY: - Add a test case for asym awq - Add processing steps for `mit-han-lab/pile-val-backup` - Update incorrect logger.warning call Testing - Generated model which was pushed to the hub successfully and ran using vLLM - Model: https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v1.0-w4a16-asym-awq-e2e/tree/main

SUMMARY: LlamaAttention.forward has an optional `attention_mask` field that has no default (see [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L246)). So `attention_mask=None` must be passed in, otherwise AWQ will error out. The previous check only worked for Python 3.10 and 3.11. This fixes it to be a more general recommended solution that works with Python 3.9 ```python from transformers.models.llama.modeling_llama import LlamaAttention import inspect import typing params = inspect.signature(LlamaAttention.forward).parameters #old check old_check = (params["attention_mask"].annotation._name == "Optional") #new check new_check = (params["attention_mask"].default is inspect.Parameter.empty) print(f"OLD {old_check}, NEW {new_check}") # Python 3.9: OLD False, NEW True # Python 3.11: OLD True, NEW True ``` TEST PLAN: This will resolve the failing e2e test at https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14654995202/job/41128588916#step:15:33208 --------- Signed-off-by: Brian Dellabetta <[email protected]>

SUMMARY: This PR resolves the issues surrounding an Optional parameter passed into a torch module's .forward method during AWQ. Previous attempts to resolve in vllm-project#1384 also added kwargs for parameters passed in positionally later on. This will make the addition to kwargs more strict, only if the annotation indicates if it is an optional field. This hotfix will fail if optional fields are passed in positionally, if typing annotation is `a: int | None` instead of `a: typing.Optional[int]`, or if there is no typehint at all and the field is not provided. It will be addressed with a more general solution soon, see vllm-project#1385 TEST PLAN: New test was run with python 3.9 and passed -- https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14713323963/job/41291028422 Signed-off-by: Brian Dellabetta <[email protected]>

## Purpose ## * Reduce size of llm-compressor package ## Changes ## * Exclude all image files using `MANIFEST.in` Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-04-29T15:23:36Z

@kelkelcheng Would you be so kind as to merge in these changes? #1399. This adds a gemma3 example and fixes style warnings

kylesayrs · 2025-04-29T15:28:30Z

It looks like there may be some tracing assumptions that are broken, as the example fails.

Line 36

traceback

Traceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 53, in forward outputs = forward_fn(*args, **kwargs) File "", line 36, in forward TypeError: 'NoneType' object is not subscriptable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kyle/llm-compressor/src/llmcompressor/modifiers/quantization/gptq/base.py", line 234, in on_initialize
run_sequential(
File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/pipeline.py", line 67, in run_pipeline
subgraph.forward(model, **inputs)
File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 55, in forward
raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:

1
2 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3_mask_pad_token_id")
3 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3_triu_causal_mask")
4 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3_bidirectional_mask")
5 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3__prepare_4d_causal_attention_mask_with_cache_position")
6
7 def forward(self, input_ids : torch.Tensor, pixel_values : torch.Tensor, attention_mask : torch.Tensor, token_type_ids : torch.Tensor):
8     language_model_model_norm_weight = self.language_model.model.norm.weight
9     language_model_model_rotary_emb_local_inv_freq = self.language_model.model.rotary_emb_local.inv_freq
10     language_model_model_rotary_emb_inv_freq = self.language_model.model.rotary_emb.inv_freq
11     language_model_model_embed_tokens_embed_scale = self.language_model.model.embed_tokens.embed_scale
12     language_model_model_embed_tokens_weight = self.language_model.model.embed_tokens.weight
13     getattr_1 = language_model_model_embed_tokens_weight.dtype
14     gemma3config = transformers_models_gemma3_configuration_gemma3_Gemma3Config(...)
15     getitem_13 = language_model_model_rotary_emb_inv_freq[(None, slice(None, None, None), None)];  language_model_model_rotary_emb_inv_freq = None
16     getitem_16 = language_model_model_rotary_emb_local_inv_freq[(None, slice(None, None, None), None)];  language_model_model_rotary_emb_local_inv_freq = None
17     float_10 = language_model_model_norm_weight.float();  language_model_model_norm_weight = None
18     embedding = torch.nn.functional.embedding(input_ids, language_model_model_embed_tokens_weight, padding_idx = 0, max_norm = None, norm_type = 2.0, scale_grad_by_freq = False, sparse = False);  language_model_model_embed_tokens_weight = None
19     eq_2 = input_ids == 262144   
20     vision_tower = self.vision_tower(pixel_values = pixel_values);  pixel_values = None
21     dim = attention_mask.dim()   
22     dim_1 = attention_mask.dim() 
23     size_2 = attention_mask.size()
24     getitem_8 = attention_mask[(slice(None, None, None), None, None, slice(None, None, None))]
25     to = language_model_model_embed_tokens_embed_scale.to(getattr_1);  language_model_model_embed_tokens_embed_scale = getattr_1 = None
26     mask_pad_token_id = llmcompressor_transformers_tracing_gemma3_mask_pad_token_id(None, input_ids, -1, gemma3config);  input_ids = gemma3config = None
27     float_1 = getitem_13.float();  getitem_13 = None
28     float_5 = getitem_16.float();  getitem_16 = None
29     add_5 = 1.0 + float_10;  float_10 = None
30     unsqueeze = eq_2.unsqueeze(-1);  eq_2 = None
31     getattr_3 = vision_tower.last_hidden_state;  vision_tower = None
32     eq_3 = dim == 4;  dim = eq_3 = None
33     eq_4 = dim_1 == 4;  dim_1 = eq_4 = None
34     getitem_6 = size_2[-1];  size_2 = None
35     mul = embedding * to;  embedding = to = None
36     getitem_55 = mask_pad_token_id[(Ellipsis, slice(1, None, None))];  mask_pad_token_id = None
</details>

kelkelcheng · 2025-04-29T16:10:41Z

@kelkelcheng Would you be so kind as to merge in these changes? #1399. This adds a gemma3 example and fixes style warnings

Absolutely, and I will try to run the example and see if I can get the same issue

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-04-29T19:17:07Z

@kelkelcheng It seems that the bug is related to how an if statement was removed when adding mask_pad_token_id, specifically that there used to be an if labels statement before that. With the if statement removed, the model attempts to mask the labels, even if the user doesn't pass any labels. I made that change in #1399.

Once the example runs to competition, this should be good to go

kylesayrs · 2025-04-29T21:17:10Z

Success!
https://huggingface.co/nm-testing/gemma-3-4b-it-W4A16-G128/tree/main/

Signed-off-by: Kelvin Cheng <[email protected]>

kelkelcheng · 2025-04-29T23:46:40Z

Success! https://huggingface.co/nm-testing/gemma-3-4b-it-W4A16-G128/tree/main/

Awesome—thanks so much for digging in and testing!
I just merged the changes from #1399 and pushed again. Looks like the DCO is acting up, but let me know if there’s anything I need to fix.

SUMMARY: Add support for tracing of Gemma3: [issue#1248](#1248). Steps that I have done: 1. Create gemma3.py from HF and update __init__.py. 2. Classes and functions that I modified: 2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward 2.2 Gemma3TextModel: _update_causal_mask, forward, and _prepare_4d_causal_attention_mask_with_cache_position TEST PLAN: Ran: `llmcompressor.trace --model_id google/gemma-3-4b-it --model_class TraceableGemma3ForConditionalGeneration --ignore "lm_head" "re:vision_tower.*" --modality vision` Output: <img width="796" alt="trace_output" src="https://github.com/user-attachments/assets/8f5c9c7d-32a9-4b12-b4b2-10b6a4352846" /> This is my first attempt at solving this issue. It is a fun learning experience and please review it carefully. Gemma3 can go through tracing now, but we might need further tests for the quantization as well. --------- Signed-off-by: Kelvin Cheng <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: Domenic Barbuzzi <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Vedant <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]> Co-authored-by: Domenic Barbuzzi <[email protected]>

dsikka requested a review from kylesayrs April 23, 2025 17:48

kylesayrs reviewed Apr 23, 2025

View reviewed changes

src/llmcompressor/transformers/tracing/gemma3.py Show resolved Hide resolved

kylesayrs reviewed Apr 23, 2025

View reviewed changes

src/llmcompressor/transformers/tracing/gemma3.py Outdated Show resolved Hide resolved

kelkelcheng and others added 3 commits April 27, 2025 02:27

Add Gemma3 model tracing support

bccc54b

Signed-off-by: Kelvin Cheng <[email protected]>

Refactor Gemma3 tracing imports and update function signatures for cl…

2877679

…arity Signed-off-by: Kelvin Cheng <[email protected]>

kelkelcheng force-pushed the kc/gemma-3-tracing-support branch from 49e7e84 to 0e322b6 Compare April 27, 2025 06:27

kylesayrs previously approved these changes Apr 27, 2025

View reviewed changes

src/llmcompressor/transformers/tracing/gemma3.py Show resolved Hide resolved

src/llmcompressor/transformers/tracing/gemma3.py Show resolved Hide resolved

src/llmcompressor/transformers/tracing/gemma3.py Show resolved Hide resolved

Remove debug print statement

b5be87f

Signed-off-by: Kelvin Cheng <[email protected]>

kelkelcheng dismissed kylesayrs’s stale review via b5be87f April 27, 2025 23:24

dsikka and others added 12 commits April 29, 2025 10:57

[Tests] Fix test case; update structure (vllm-project#1375)

0683e1e

# Summary - Fix device_map and set torch.dtype for the given model - Move tests to a folder which makes more sense

Bump version; set ct version (vllm-project#1381)

d3c0f0a

Exclude images from package (vllm-project#1397)

11138ba

## Purpose ## * Reduce size of llm-compressor package ## Changes ## * Exclude all image files using `MANIFEST.in` Signed-off-by: Kyle Sayers <[email protected]>

add gemma3 example

8b9f7c4

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/gemma3-example

ea481be

add back labels check

81c5799

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs and others added 2 commits April 29, 2025 19:29

Merge remote-tracking branch 'origin' into kylesayrs/gemma3-example

0da3931

Signed-off-by: Kelvin Cheng <[email protected]>

Merge branch 'kylesayrs/gemma3-example' into kc/gemma-3-tracing-support

c236f32

Signed-off-by: Kelvin Cheng <[email protected]>

kylesayrs changed the title ~~[WIP][Tracing] Support tracing of Gemma3 [#1248]~~ [Tracing] Support tracing of Gemma3 [#1248] May 2, 2025

kylesayrs approved these changes May 2, 2025

View reviewed changes

Merge branch 'main' into kc/gemma-3-tracing-support

94e07cc

dsikka added the ready When a PR is ready for review label May 2, 2025

dsikka approved these changes May 2, 2025

View reviewed changes

dsikka enabled auto-merge (squash) May 2, 2025 16:33

brian-dellabetta approved these changes May 2, 2025

View reviewed changes

kylesayrs mentioned this pull request May 2, 2025

[VLM] Add Gemma3 Example #1399

Closed

Merge branch 'main' into kc/gemma-3-tracing-support

61cc2f9

dsikka merged commit 4d630df into vllm-project:main May 3, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracing] Support tracing of Gemma3 [#1248] #1373

[Tracing] Support tracing of Gemma3 [#1248] #1373

Uh oh!

kelkelcheng commented Apr 23, 2025

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kelkelcheng commented Apr 29, 2025

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kelkelcheng commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

[Tracing] Support tracing of Gemma3 [#1248] #1373

[Tracing] Support tracing of Gemma3 [#1248] #1373

Uh oh!

Conversation

kelkelcheng commented Apr 23, 2025

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kelkelcheng commented Apr 29, 2025

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kylesayrs commented Apr 29, 2025

Uh oh!

kelkelcheng commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!