[NVFP4] Fix onloading of fused layers #1512

dsikka · 2025-06-04T17:37:09Z

Summary

Properly onload qkv and gate/up layers when updating global scales with cpu offloading

Testing:

Tested in memory-retrained case to ensure proper behaviour

github-actions · 2025-06-04T17:37:20Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Hello @dsikka, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. Based on the title and the changes, this PR focuses on refining the memory handling and global scale calculations specifically for NVFP4 quantization. The core changes involve refactoring how global scales are computed within the calibration process, introducing a new utility to ensure fused layers (like attention and MLP blocks) share a unified global scale as required by certain backends (like vLLM), and updating the estimated memory requirements calculation. Additionally, sample generation code has been added to two Llama3 examples, likely to facilitate testing the impact of these quantization changes.

Highlights

Global Scale Calculation Refactor: The call_observer function in calibration.py has been modified to allow for separate calculation of global scales and quantization parameters (scale and zero point), providing more granular control over the calibration process.
Fused Layer Global Scale Unification: A new utility function update_fused_layer_weight_global_scales has been introduced. This function iterates through attention and MLP modules and, for TENSOR_GROUP quantization, enforces a single global scale (the minimum of the individual scales) across fused layers (q/k/v and gate/up). This addresses a specific requirement for backends like vLLM when using NVFP4.
Updated Memory Estimation: The quantization_memory_requirement helper function has been updated. The assumed group size for calculation has changed from 128 to 16, and the multiplier for total elements has changed from 4 to 5, potentially accounting for activation memory or other overheads.
Example Script Updates: Sample text generation code has been added to the llama3_example.py scripts for both w4a16_fp4 and w4a4_fp4 quantization examples. This allows for quick verification of model output after quantization.

Changelog

Click here to see the changelog

examples/quantization_w4a16_fp4/llama3_example.py
- Added sample text generation code after the oneshot quantization step (lines 22-27).
examples/quantization_w4a4_fp4/llama3_example.py
- Added sample text generation code after the oneshot quantization step (lines 70-75).
src/llmcompressor/modifiers/quantization/calibration.py
- Exported the new update_weight_global_scale function (line 32).
- Modified the call_observer function signature to include should_calculate_gparam and should_calculate_qparams flags (lines 70-76).
- Refactored the logic within call_observer to conditionally calculate and update global scale and qparams based on the new flags (lines 99-113).
- Added the new update_weight_global_scale function to specifically calculate and update the weight global scale for TENSOR_GROUP quantization (lines 116-132).
- Updated calibrate_activations to pass appropriate calculation flags (calculate_gparam, calculate_qparams) to call_observer (lines 184-189).
src/llmcompressor/modifiers/quantization/quantization/base.py
- Imported update_weight_global_scale and update_fused_layer_weight_global_scales (lines 6-7, 10).
- Added a loop in on_start to call update_weight_global_scale for all modules before the main weight calibration loop (lines 76-77).
- Added a call to update_fused_layer_weight_global_scales within the main weight calibration loop (line 80).
src/llmcompressor/modifiers/utils/init.py
- Imported the new helpers module (line 4).
src/llmcompressor/modifiers/utils/helpers.py
- Added a new file containing the update_fused_layer_weight_global_scales function, which unifies global scales for fused attention and MLP layers under TENSOR_GROUP quantization (lines 11-107).
src/llmcompressor/observers/helpers.py
- Removed the generate_gparam function and its import, as it has been moved (lines 1-11, 20-54).
src/llmcompressor/observers/min_max.py
- Updated the import source for the generate_gparam function to compressed_tensors.quantization.utils (line 5).
src/llmcompressor/transformers/compression/helpers.py
- Updated the memory requirement calculation, changing the assumed group size from 128 to 16 and the multiplier from 4 to 5 (lines 160, 165).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Memory fixed, scales align,
NVFP4 runs faster now,
Code review time.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces changes to address NVFP4 memory considerations, primarily by refactoring calibration logic and adding support for fused layer global scale updates required by vLLM. It also includes sample generation in the example scripts, which is helpful for demonstration. The refactoring of the call_observer function and the introduction of update_weight_global_scale seem logically sound for separating global scale calculation. However, there are a few areas for improvement regarding efficiency, clarity, and correctness in helper functions.

Summary of Findings

Efficiency: Multiple loops over modules: The on_start method in QuantizationModifier iterates over all modules twice (once for global scale, once for zp/scale). These could potentially be combined into a single loop for better efficiency and maintainability. (Medium severity, commented)
Clarity: Tqdm description: The tqdm description for the second loop in on_start is slightly misleading as the first loop also processes weights. (Medium severity, commented)
Maintainability/Correctness: Hardcoded group size: The quantization_memory_requirement function hardcodes the group size to 16, which might not match the actual recipe and could lead to inaccurate memory estimates. (Medium severity, commented)
Correctness: Unexplained magic number in memory calculation: The use of the magic number 5 in the quantization_memory_requirement function is unclear and lacks explanation, potentially leading to incorrect memory estimations. (High severity, commented)
Documentation: Docstring parameter name: The docstring for update_fused_layer_weight_global_scales incorrectly refers to a model parameter instead of submodule. (Low severity, not commented due to settings)
Code Style: Unnecessary del calls: The del global_scale calls in update_fused_layer_weight_global_scales are unnecessary in Python and do not guarantee immediate memory release for PyTorch tensors. (Low severity, not commented due to settings)

Merge Readiness

The pull request introduces necessary changes for NVFP4 support, including refactoring calibration logic and adding fused layer scale updates. However, there is a high-severity issue regarding an unclear magic number in the memory calculation helper, which should be addressed for correctness. There are also medium-severity issues related to potential efficiency improvements and clarity. I recommend addressing the high-severity issue and considering the medium-severity points before merging. I am unable to approve this pull request; please have other reviewers approve this code before merging.

src/llmcompressor/transformers/compression/helpers.py

src/llmcompressor/modifiers/quantization/quantization/base.py

src/llmcompressor/transformers/compression/helpers.py

Summary - Properly onload qkv and gate/up layers when updating global scales with cpu offloading Testing: - Tested in memory-retrained case to ensure proper behaviour

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 4, 2025

View reviewed changes

dsikka added 3 commits June 4, 2025 19:55

small memory updates

b2d4526

updatge

d2de2e7

update

59a590b

dsikka force-pushed the fix_nvfp4_memory branch from a70a366 to 59a590b Compare June 4, 2025 19:55

dsikka added 4 commits June 4, 2025 19:57

revert example changes

00c1972

revert

652ad3f

revert

3c6b4db

Revert

a26cdf4

dsikka changed the title ~~[nvfp4] Fix nvfp4 memory~~ [NVFP4] Fix onloading of fused layers Jun 4, 2025

dsikka added the ready When a PR is ready for review label Jun 4, 2025

dsikka marked this pull request as ready for review June 4, 2025 21:02

rahul-tuli approved these changes Jun 4, 2025

View reviewed changes

brian-dellabetta approved these changes Jun 4, 2025

View reviewed changes

dsikka enabled auto-merge (squash) June 4, 2025 21:25

dsikka merged commit e7c6ef4 into main Jun 4, 2025
13 checks passed

dsikka deleted the fix_nvfp4_memory branch June 4, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVFP4] Fix onloading of fused layers #1512

[NVFP4] Fix onloading of fused layers #1512

Uh oh!

dsikka commented Jun 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[NVFP4] Fix onloading of fused layers #1512

[NVFP4] Fix onloading of fused layers #1512

Uh oh!

Conversation

dsikka commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka commented Jun 4, 2025 •

edited

Loading