-
Notifications
You must be signed in to change notification settings - Fork 188
[NVFP4] Fix onloading of fused layers #1512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @dsikka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. Based on the title and the changes, this PR focuses on refining the memory handling and global scale calculations specifically for NVFP4 quantization. The core changes involve refactoring how global scales are computed within the calibration process, introducing a new utility to ensure fused layers (like attention and MLP blocks) share a unified global scale as required by certain backends (like vLLM), and updating the estimated memory requirements calculation. Additionally, sample generation code has been added to two Llama3 examples, likely to facilitate testing the impact of these quantization changes.
Highlights
- Global Scale Calculation Refactor: The
call_observer
function incalibration.py
has been modified to allow for separate calculation of global scales and quantization parameters (scale and zero point), providing more granular control over the calibration process. - Fused Layer Global Scale Unification: A new utility function
update_fused_layer_weight_global_scales
has been introduced. This function iterates through attention and MLP modules and, for TENSOR_GROUP quantization, enforces a single global scale (the minimum of the individual scales) across fused layers (q/k/v and gate/up). This addresses a specific requirement for backends like vLLM when using NVFP4. - Updated Memory Estimation: The
quantization_memory_requirement
helper function has been updated. The assumed group size for calculation has changed from 128 to 16, and the multiplier for total elements has changed from 4 to 5, potentially accounting for activation memory or other overheads. - Example Script Updates: Sample text generation code has been added to the
llama3_example.py
scripts for both w4a16_fp4 and w4a4_fp4 quantization examples. This allows for quick verification of model output after quantization.
Changelog
Click here to see the changelog
- examples/quantization_w4a16_fp4/llama3_example.py
- Added sample text generation code after the
oneshot
quantization step (lines 22-27).
- Added sample text generation code after the
- examples/quantization_w4a4_fp4/llama3_example.py
- Added sample text generation code after the
oneshot
quantization step (lines 70-75).
- Added sample text generation code after the
- src/llmcompressor/modifiers/quantization/calibration.py
- Exported the new
update_weight_global_scale
function (line 32). - Modified the
call_observer
function signature to includeshould_calculate_gparam
andshould_calculate_qparams
flags (lines 70-76). - Refactored the logic within
call_observer
to conditionally calculate and update global scale and qparams based on the new flags (lines 99-113). - Added the new
update_weight_global_scale
function to specifically calculate and update the weight global scale for TENSOR_GROUP quantization (lines 116-132). - Updated
calibrate_activations
to pass appropriate calculation flags (calculate_gparam
,calculate_qparams
) tocall_observer
(lines 184-189).
- Exported the new
- src/llmcompressor/modifiers/quantization/quantization/base.py
- Imported
update_weight_global_scale
andupdate_fused_layer_weight_global_scales
(lines 6-7, 10). - Added a loop in
on_start
to callupdate_weight_global_scale
for all modules before the main weight calibration loop (lines 76-77). - Added a call to
update_fused_layer_weight_global_scales
within the main weight calibration loop (line 80).
- Imported
- src/llmcompressor/modifiers/utils/init.py
- Imported the new
helpers
module (line 4).
- Imported the new
- src/llmcompressor/modifiers/utils/helpers.py
- Added a new file containing the
update_fused_layer_weight_global_scales
function, which unifies global scales for fused attention and MLP layers under TENSOR_GROUP quantization (lines 11-107).
- Added a new file containing the
- src/llmcompressor/observers/helpers.py
- Removed the
generate_gparam
function and its import, as it has been moved (lines 1-11, 20-54).
- Removed the
- src/llmcompressor/observers/min_max.py
- Updated the import source for the
generate_gparam
function tocompressed_tensors.quantization.utils
(line 5).
- Updated the import source for the
- src/llmcompressor/transformers/compression/helpers.py
- Updated the memory requirement calculation, changing the assumed group size from 128 to 16 and the multiplier from 4 to 5 (lines 160, 165).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Memory fixed, scales align,
NVFP4 runs faster now,
Code review time.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces changes to address NVFP4 memory considerations, primarily by refactoring calibration logic and adding support for fused layer global scale updates required by vLLM. It also includes sample generation in the example scripts, which is helpful for demonstration. The refactoring of the call_observer
function and the introduction of update_weight_global_scale
seem logically sound for separating global scale calculation. However, there are a few areas for improvement regarding efficiency, clarity, and correctness in helper functions.
Summary of Findings
- Efficiency: Multiple loops over modules: The
on_start
method inQuantizationModifier
iterates over all modules twice (once for global scale, once for zp/scale). These could potentially be combined into a single loop for better efficiency and maintainability. (Medium severity, commented) - Clarity: Tqdm description: The tqdm description for the second loop in
on_start
is slightly misleading as the first loop also processes weights. (Medium severity, commented) - Maintainability/Correctness: Hardcoded group size: The
quantization_memory_requirement
function hardcodes the group size to 16, which might not match the actual recipe and could lead to inaccurate memory estimates. (Medium severity, commented) - Correctness: Unexplained magic number in memory calculation: The use of the magic number
5
in thequantization_memory_requirement
function is unclear and lacks explanation, potentially leading to incorrect memory estimations. (High severity, commented) - Documentation: Docstring parameter name: The docstring for
update_fused_layer_weight_global_scales
incorrectly refers to amodel
parameter instead ofsubmodule
. (Low severity, not commented due to settings) - Code Style: Unnecessary
del
calls: Thedel global_scale
calls inupdate_fused_layer_weight_global_scales
are unnecessary in Python and do not guarantee immediate memory release for PyTorch tensors. (Low severity, not commented due to settings)
Merge Readiness
The pull request introduces necessary changes for NVFP4 support, including refactoring calibration logic and adding fused layer scale updates. However, there is a high-severity issue regarding an unclear magic number in the memory calculation helper, which should be addressed for correctness. There are also medium-severity issues related to potential efficiency improvements and clarity. I recommend addressing the high-severity issue and considering the medium-severity points before merging. I am unable to approve this pull request; please have other reviewers approve this code before merging.
Summary - Properly onload qkv and gate/up layers when updating global scales with cpu offloading Testing: - Tested in memory-retrained case to ensure proper behaviour
Summary
Testing: