Skip to content

[Tracing] Support tracing of Gemma3 [#1248] #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 3, 2025

Conversation

kelkelcheng
Copy link
Contributor

SUMMARY:
Add support for tracing of Gemma3: issue#1248.

Steps that I have done:

  1. Create gemma3.py from HF and update init.py.
  2. Classes and functions that I modified:
    2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward
    2.2 Gemma3TextModel: _update_causal_mask, forward, and _prepare_4d_causal_attention_mask_with_cache_position

TEST PLAN:
Ran:
llmcompressor.trace --model_id google/gemma-3-4b-it --model_class TraceableGemma3ForConditionalGeneration --ignore "lm_head" "re:vision_tower.*" --modality vision

Output:
trace_output

This is my first attempt at solving this issue. It is a fun learning experience and please review it carefully.
Gemma3 can go through tracing now, but we might need further tests for the quantization as well.

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@dsikka dsikka requested a review from kylesayrs April 23, 2025 17:48
kelkelcheng and others added 3 commits April 27, 2025 02:27
## Purpose ##
* Add better exception messages when encountering tracing errors

## Example ##
* Below is an example of a potential tracing runtime error (this
particular error was forced for demonstration purposes)
````
Traceback (most recent call last):
  File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 45, in forward
    outputs = forward_fn(*args, **kwargs)
  File "<string>", line 12, in forward
TypeError: iter(v, w): v must be callable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kyle/llm-compressor/src/llmcompressor/modifiers/quantization/gptq/base.py", line 234, in on_initialize
    run_sequential(
  File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/pipeline.py", line 67, in run_pipeline
    subgraph.forward(model, **inputs)
  File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 47, in forward
    raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:
```
1
2
3
4 def forward(self, input_ids : torch.Tensor, attention_mask :
torch.Tensor):
5     model_rotary_emb_inv_freq = self.model.rotary_emb.inv_freq
6 getitem_10 = model_rotary_emb_inv_freq[(None, slice(None, None, None),
None)]; model_rotary_emb_inv_freq = None
7 model_embed_tokens = self.model.embed_tokens(input_ids); input_ids =
None
8     size_3 = attention_mask.size();  size_3 = None
9     dim = attention_mask.dim()
10     size_6 = attention_mask.size()
11 getitem_8 = attention_mask[(slice(None, None, None), None, None,
slice(None, None, None))]
12     iter_6 = iter(attention_mask, 'device');  attention_mask = None
13     float_1 = getitem_10.float();  getitem_10 = None
14     size = model_embed_tokens.size()
15     iter_1 = iter(model_embed_tokens, 'device')

```
````

## Changes ##
* Move forward call to inside Subgraph class and wrap forward call in
order to catch and propagate exceptions

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kelvin Cheng <[email protected]>
@kelkelcheng kelkelcheng force-pushed the kc/gemma-3-tracing-support branch from 49e7e84 to 0e322b6 Compare April 27, 2025 06:27
kylesayrs
kylesayrs previously approved these changes Apr 27, 2025
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some relevant improvements to the tracing system. In the meantime, this is good to land, great job!

Signed-off-by: Kelvin Cheng <[email protected]>
dsikka and others added 12 commits April 29, 2025 10:57
# Summary
- Fix device_map and set torch.dtype for the given model
- Move tests to a folder which makes more sense
…vllm-project#1328)

SUMMARY:
Fixed issue vllm-project#1319 where Recipe.model_dump() output couldn't be used with
Recipe.model_validate(). Implemented an override of the model_dump()
method to ensure it produces output in the format expected by
validation, enabling proper round-trip serialization using standard
Pydantic methods.

TEST PLAN:
Created test cases to verify fix works with both simple and complex
recipes
Confirmed Recipe.model_validate(recipe.model_dump()) succeeds with
various recipe formats
Validated that recipes with multiple stages having the same group name
serialize/deserialize correctly
Ensured existing YAML serialization pathways continue to work as
expected

---------

Signed-off-by: Rahul Tuli <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
Co-authored-by: Rahul Tuli <[email protected]>
…oject#1377)

This PR adds comprehensive documentation for the compression parameters
available in the enhanced `save_pretrained` method. These parameters are
critical for users working with model compression but were previously
undocumented.

## Changes

- Adds a new `docs/save_pretrained.md` file explaining:
  - How the enhanced `save_pretrained` method works
  - Detailed descriptions of all compression parameters
  - Code examples showing common usage patterns
  - Notes on compatibility with loading compressed models

## Benefits

- **Better User Experience:** Users can clearly understand all available
options
- **Improved Onboarding:** New users can quickly learn how to save
compressed models
- **Comprehensive Examples:** Shows different approaches for saving
models with compression

This documentation supports
[ticket](https://issues.redhat.com/browse/INFERENG-578) and will help
users leverage the full capabilities of the compression functionality in
the save process.

---------

Signed-off-by: Rahul Tuli <[email protected]>
…oject#1378)

This PR reverts commit 998be99 which
was merged prematurely. The required base tests were skipped during the
original review process. When these tests eventually ran on the main
branch, they revealed a failure:


https://github.com/vllm-project/llm-compressor/actions/runs/14628792870/job/41046641641

The original PR vllm-project#1328 has been reopened to address the identified issues
before resubmitting.
SUMMARY:
Add a robustness check to AWQ to exclude mappings where the layer shapes
don't align. This is a known issue, just wanted to do it in a different
PR because it occurs in a different location in AutoAWQ, and I wanted to
keep the initial AWQ PR to be as close to the AutoAWQ implementation so
the git history reflects our changes well.


TEST PLAN:
Added unit test to check default case and edge cases

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
SUMMARY:
Update the `w4a16_actorder_weight.yaml` lmeval config to have the
appropriate `scheme` value. This resolves an issue where the timing CSV
files had a filename clash resulting in the later test overwriting the
first.

TEST PLAN:
Affected tests have been executed in our internal CI to verify.

Signed-off-by: Domenic Barbuzzi <[email protected]>
SUMMARY:
- Add a test case for asym awq 
- Add processing steps for `mit-han-lab/pile-val-backup`
- Update incorrect logger.warning call

Testing
- Generated model which was pushed to the hub successfully and ran using
vLLM
- Model:
https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v1.0-w4a16-asym-awq-e2e/tree/main
SUMMARY:
LlamaAttention.forward has an optional `attention_mask` field that has
no default (see
[here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L246)).
So `attention_mask=None` must be passed in, otherwise AWQ will error
out.

The previous check only worked for Python 3.10 and 3.11. This fixes it
to be a more general recommended solution that works with Python 3.9

```python
from transformers.models.llama.modeling_llama import LlamaAttention
import inspect
import typing

params = inspect.signature(LlamaAttention.forward).parameters

#old check
old_check = (params["attention_mask"].annotation._name == "Optional")
#new check
new_check = (params["attention_mask"].default is inspect.Parameter.empty)

print(f"OLD {old_check}, NEW {new_check}")
# Python 3.9: OLD False, NEW True
# Python 3.11: OLD True, NEW True
```


TEST PLAN:
This will resolve the failing e2e test at
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14654995202/job/41128588916#step:15:33208

---------

Signed-off-by: Brian Dellabetta <[email protected]>
SUMMARY:
This PR resolves the issues surrounding an Optional parameter passed
into a torch module's .forward method during AWQ. Previous attempts to
resolve in vllm-project#1384 also added kwargs for parameters passed in positionally
later on. This will make the addition to kwargs more strict, only if the
annotation indicates if it is an optional field.

This hotfix will fail if optional fields are passed in positionally, if
typing annotation is `a: int | None` instead of `a:
typing.Optional[int]`, or if there is no typehint at all and the field
is not provided. It will be addressed with a more general solution soon,
see vllm-project#1385


TEST PLAN:
New test was run with python 3.9 and passed --
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14713323963/job/41291028422

Signed-off-by: Brian Dellabetta <[email protected]>
## Purpose ##
* Reduce size of llm-compressor package

## Changes ##
* Exclude all image files using `MANIFEST.in`

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator

@kelkelcheng Would you be so kind as to merge in these changes? #1399. This adds a gemma3 example and fixes style warnings

@kylesayrs
Copy link
Collaborator

It looks like there may be some tracing assumptions that are broken, as the example fails.

Line 36

traceback Traceback (most recent call last): File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 53, in forward outputs = forward_fn(*args, **kwargs) File "", line 36, in forward TypeError: 'NoneType' object is not subscriptable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kyle/llm-compressor/src/llmcompressor/modifiers/quantization/gptq/base.py", line 234, in on_initialize
run_sequential(
File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/pipeline.py", line 67, in run_pipeline
subgraph.forward(model, **inputs)
File "/home/kyle/llm-compressor/src/llmcompressor/pipelines/sequential/helpers.py", line 55, in forward
raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:

1
2 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3_mask_pad_token_id")
3 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3_triu_causal_mask")
4 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3_bidirectional_mask")
5 torch.fx._symbolic_trace.wrap("llmcompressor_transformers_tracing_gemma3__prepare_4d_causal_attention_mask_with_cache_position")
6
7 def forward(self, input_ids : torch.Tensor, pixel_values : torch.Tensor, attention_mask : torch.Tensor, token_type_ids : torch.Tensor):
8     language_model_model_norm_weight = self.language_model.model.norm.weight
9     language_model_model_rotary_emb_local_inv_freq = self.language_model.model.rotary_emb_local.inv_freq
10     language_model_model_rotary_emb_inv_freq = self.language_model.model.rotary_emb.inv_freq
11     language_model_model_embed_tokens_embed_scale = self.language_model.model.embed_tokens.embed_scale
12     language_model_model_embed_tokens_weight = self.language_model.model.embed_tokens.weight
13     getattr_1 = language_model_model_embed_tokens_weight.dtype
14     gemma3config = transformers_models_gemma3_configuration_gemma3_Gemma3Config(...)
15     getitem_13 = language_model_model_rotary_emb_inv_freq[(None, slice(None, None, None), None)];  language_model_model_rotary_emb_inv_freq = None
16     getitem_16 = language_model_model_rotary_emb_local_inv_freq[(None, slice(None, None, None), None)];  language_model_model_rotary_emb_local_inv_freq = None
17     float_10 = language_model_model_norm_weight.float();  language_model_model_norm_weight = None
18     embedding = torch.nn.functional.embedding(input_ids, language_model_model_embed_tokens_weight, padding_idx = 0, max_norm = None, norm_type = 2.0, scale_grad_by_freq = False, sparse = False);  language_model_model_embed_tokens_weight = None
19     eq_2 = input_ids == 262144   
20     vision_tower = self.vision_tower(pixel_values = pixel_values);  pixel_values = None
21     dim = attention_mask.dim()   
22     dim_1 = attention_mask.dim() 
23     size_2 = attention_mask.size()
24     getitem_8 = attention_mask[(slice(None, None, None), None, None, slice(None, None, None))]
25     to = language_model_model_embed_tokens_embed_scale.to(getattr_1);  language_model_model_embed_tokens_embed_scale = getattr_1 = None
26     mask_pad_token_id = llmcompressor_transformers_tracing_gemma3_mask_pad_token_id(None, input_ids, -1, gemma3config);  input_ids = gemma3config = None
27     float_1 = getitem_13.float();  getitem_13 = None
28     float_5 = getitem_16.float();  getitem_16 = None
29     add_5 = 1.0 + float_10;  float_10 = None
30     unsqueeze = eq_2.unsqueeze(-1);  eq_2 = None
31     getattr_3 = vision_tower.last_hidden_state;  vision_tower = None
32     eq_3 = dim == 4;  dim = eq_3 = None
33     eq_4 = dim_1 == 4;  dim_1 = eq_4 = None
34     getitem_6 = size_2[-1];  size_2 = None
35     mul = embedding * to;  embedding = to = None
36     getitem_55 = mask_pad_token_id[(Ellipsis, slice(1, None, None))];  mask_pad_token_id = None
</details>

@kelkelcheng
Copy link
Contributor Author

@kelkelcheng Would you be so kind as to merge in these changes? #1399. This adds a gemma3 example and fixes style warnings

Absolutely, and I will try to run the example and see if I can get the same issue

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator

@kelkelcheng It seems that the bug is related to how an if statement was removed when adding mask_pad_token_id, specifically that there used to be an if labels statement before that. With the if statement removed, the model attempts to mask the labels, even if the user doesn't pass any labels. I made that change in #1399.

Once the example runs to competition, this should be good to go

@kylesayrs
Copy link
Collaborator

Success!
https://huggingface.co/nm-testing/gemma-3-4b-it-W4A16-G128/tree/main/

@kelkelcheng
Copy link
Contributor Author

Success! https://huggingface.co/nm-testing/gemma-3-4b-it-W4A16-G128/tree/main/

Awesome—thanks so much for digging in and testing!
I just merged the changes from #1399 and pushed again. Looks like the DCO is acting up, but let me know if there’s anything I need to fix.

@kylesayrs kylesayrs changed the title [WIP][Tracing] Support tracing of Gemma3 [#1248] [Tracing] Support tracing of Gemma3 [#1248] May 2, 2025
@dsikka dsikka added the ready When a PR is ready for review label May 2, 2025
@dsikka dsikka enabled auto-merge (squash) May 2, 2025 16:33
@dsikka dsikka merged commit 4d630df into vllm-project:main May 3, 2025
5 checks passed
kylesayrs added a commit that referenced this pull request May 4, 2025
SUMMARY:
Add support for tracing of Gemma3:
[issue#1248](#1248).

Steps that I have done:
1. Create gemma3.py from HF and update __init__.py.
2. Classes and functions that I modified:
    2.1 Gemma3ForConditionalGeneration: _update_causal_mask and forward
2.2 Gemma3TextModel: _update_causal_mask, forward, and
_prepare_4d_causal_attention_mask_with_cache_position


TEST PLAN:
Ran:
`llmcompressor.trace --model_id google/gemma-3-4b-it --model_class
TraceableGemma3ForConditionalGeneration --ignore "lm_head"
"re:vision_tower.*" --modality vision`

Output:
<img width="796" alt="trace_output"
src="https://github.com/user-attachments/assets/8f5c9c7d-32a9-4b12-b4b2-10b6a4352846"
/>

This is my first attempt at solving this issue. It is a fun learning
experience and please review it carefully.
Gemma3 can go through tracing now, but we might need further tests for
the quantization as well.

---------

Signed-off-by: Kelvin Cheng <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Domenic Barbuzzi <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Vedant <[email protected]>
Co-authored-by: Rahul Tuli <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Co-authored-by: Domenic Barbuzzi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants