Skip to content

[NVFP4] Fix onloading of fused layers #1512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 4, 2025
Merged

[NVFP4] Fix onloading of fused layers #1512

merged 7 commits into from
Jun 4, 2025

Conversation

dsikka
Copy link
Collaborator

@dsikka dsikka commented Jun 4, 2025

Summary

  • Properly onload qkv and gate/up layers when updating global scales with cpu offloading

Testing:

  • Tested in memory-retrained case to ensure proper behaviour

Copy link

github-actions bot commented Jun 4, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @dsikka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. Based on the title and the changes, this PR focuses on refining the memory handling and global scale calculations specifically for NVFP4 quantization. The core changes involve refactoring how global scales are computed within the calibration process, introducing a new utility to ensure fused layers (like attention and MLP blocks) share a unified global scale as required by certain backends (like vLLM), and updating the estimated memory requirements calculation. Additionally, sample generation code has been added to two Llama3 examples, likely to facilitate testing the impact of these quantization changes.

Highlights

  • Global Scale Calculation Refactor: The call_observer function in calibration.py has been modified to allow for separate calculation of global scales and quantization parameters (scale and zero point), providing more granular control over the calibration process.
  • Fused Layer Global Scale Unification: A new utility function update_fused_layer_weight_global_scales has been introduced. This function iterates through attention and MLP modules and, for TENSOR_GROUP quantization, enforces a single global scale (the minimum of the individual scales) across fused layers (q/k/v and gate/up). This addresses a specific requirement for backends like vLLM when using NVFP4.
  • Updated Memory Estimation: The quantization_memory_requirement helper function has been updated. The assumed group size for calculation has changed from 128 to 16, and the multiplier for total elements has changed from 4 to 5, potentially accounting for activation memory or other overheads.
  • Example Script Updates: Sample text generation code has been added to the llama3_example.py scripts for both w4a16_fp4 and w4a4_fp4 quantization examples. This allows for quick verification of model output after quantization.

Changelog

Click here to see the changelog
  • examples/quantization_w4a16_fp4/llama3_example.py
    • Added sample text generation code after the oneshot quantization step (lines 22-27).
  • examples/quantization_w4a4_fp4/llama3_example.py
    • Added sample text generation code after the oneshot quantization step (lines 70-75).
  • src/llmcompressor/modifiers/quantization/calibration.py
    • Exported the new update_weight_global_scale function (line 32).
    • Modified the call_observer function signature to include should_calculate_gparam and should_calculate_qparams flags (lines 70-76).
    • Refactored the logic within call_observer to conditionally calculate and update global scale and qparams based on the new flags (lines 99-113).
    • Added the new update_weight_global_scale function to specifically calculate and update the weight global scale for TENSOR_GROUP quantization (lines 116-132).
    • Updated calibrate_activations to pass appropriate calculation flags (calculate_gparam, calculate_qparams) to call_observer (lines 184-189).
  • src/llmcompressor/modifiers/quantization/quantization/base.py
    • Imported update_weight_global_scale and update_fused_layer_weight_global_scales (lines 6-7, 10).
    • Added a loop in on_start to call update_weight_global_scale for all modules before the main weight calibration loop (lines 76-77).
    • Added a call to update_fused_layer_weight_global_scales within the main weight calibration loop (line 80).
  • src/llmcompressor/modifiers/utils/init.py
    • Imported the new helpers module (line 4).
  • src/llmcompressor/modifiers/utils/helpers.py
    • Added a new file containing the update_fused_layer_weight_global_scales function, which unifies global scales for fused attention and MLP layers under TENSOR_GROUP quantization (lines 11-107).
  • src/llmcompressor/observers/helpers.py
    • Removed the generate_gparam function and its import, as it has been moved (lines 1-11, 20-54).
  • src/llmcompressor/observers/min_max.py
    • Updated the import source for the generate_gparam function to compressed_tensors.quantization.utils (line 5).
  • src/llmcompressor/transformers/compression/helpers.py
    • Updated the memory requirement calculation, changing the assumed group size from 128 to 16 and the multiplier from 4 to 5 (lines 160, 165).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Memory fixed, scales align,
NVFP4 runs faster now,
Code review time.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces changes to address NVFP4 memory considerations, primarily by refactoring calibration logic and adding support for fused layer global scale updates required by vLLM. It also includes sample generation in the example scripts, which is helpful for demonstration. The refactoring of the call_observer function and the introduction of update_weight_global_scale seem logically sound for separating global scale calculation. However, there are a few areas for improvement regarding efficiency, clarity, and correctness in helper functions.

Summary of Findings

  • Efficiency: Multiple loops over modules: The on_start method in QuantizationModifier iterates over all modules twice (once for global scale, once for zp/scale). These could potentially be combined into a single loop for better efficiency and maintainability. (Medium severity, commented)
  • Clarity: Tqdm description: The tqdm description for the second loop in on_start is slightly misleading as the first loop also processes weights. (Medium severity, commented)
  • Maintainability/Correctness: Hardcoded group size: The quantization_memory_requirement function hardcodes the group size to 16, which might not match the actual recipe and could lead to inaccurate memory estimates. (Medium severity, commented)
  • Correctness: Unexplained magic number in memory calculation: The use of the magic number 5 in the quantization_memory_requirement function is unclear and lacks explanation, potentially leading to incorrect memory estimations. (High severity, commented)
  • Documentation: Docstring parameter name: The docstring for update_fused_layer_weight_global_scales incorrectly refers to a model parameter instead of submodule. (Low severity, not commented due to settings)
  • Code Style: Unnecessary del calls: The del global_scale calls in update_fused_layer_weight_global_scales are unnecessary in Python and do not guarantee immediate memory release for PyTorch tensors. (Low severity, not commented due to settings)

Merge Readiness

The pull request introduces necessary changes for NVFP4 support, including refactoring calibration logic and adding fused layer scale updates. However, there is a high-severity issue regarding an unclear magic number in the memory calculation helper, which should be addressed for correctness. There are also medium-severity issues related to potential efficiency improvements and clarity. I recommend addressing the high-severity issue and considering the medium-severity points before merging. I am unable to approve this pull request; please have other reviewers approve this code before merging.

@dsikka dsikka force-pushed the fix_nvfp4_memory branch from a70a366 to 59a590b Compare June 4, 2025 19:55
@dsikka dsikka changed the title [nvfp4] Fix nvfp4 memory [NVFP4] Fix onloading of fused layers Jun 4, 2025
@dsikka dsikka added the ready When a PR is ready for review label Jun 4, 2025
@dsikka dsikka marked this pull request as ready for review June 4, 2025 21:02
@dsikka dsikka enabled auto-merge (squash) June 4, 2025 21:25
@dsikka dsikka merged commit e7c6ef4 into main Jun 4, 2025
13 checks passed
@dsikka dsikka deleted the fix_nvfp4_memory branch June 4, 2025 21:47
aireilly pushed a commit to aireilly/llm-compressor that referenced this pull request Jul 30, 2025
Summary
- Properly onload qkv and gate/up layers when updating global scales
with cpu offloading

Testing:
- Tested in memory-retrained case to ensure proper behaviour
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants