Skip to content

Auto-Infer mappings Argument for SmoothQuantModifier Based on Model Architecture #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 4, 2024

Conversation

rahul-tuli
Copy link
Collaborator

Description:

This PR introduces a feature that automatically infers the mappings argument for the SmoothQuantModifier based on the model architecture, eliminating the need for manual specification of layer mappings.

Before:

In the prior implementation, users had to manually define layer mappings, as shown below:

quantization_stage:
  quantization_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.5
      mappings: [
        [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
        [["re:.*gate"], "re:.*post_attention_layernorm"]
      ]
      ignore: ["lm_head"]

Now:

With this update, the SmoothQuantModifier automatically infers the mappings based on the architecture, simplifying the configuration:

quantization_stage:
  quantization_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.5
      ignore: ["lm_head"]

Key Changes:

  • Auto-inference of mappings: The SmoothQuantModifier now automatically detects and applies appropriate layer mappings based on the model's architecture, making the modifier more user-friendly and reducing the risk of manual configuration errors.
  • Manual mappings parameter removal: The mappings parameter is no longer required in the configuration, as it is inferred dynamically.
  • Backward Compatibility: Existing configurations that manually specify mappings will still be supported, ensuring smooth transition and compatibility with older setups.

Motivation:

These changes improve usability by automating configuration setup and reducing user overhead, as outlined in the design document: Link to Design Doc. This also ensures that the quantization recipes are adaptable to various model architectures without manual intervention.

The autoinference of mappings were tested using a Mixtral model: Isotonic/TinyMixtral-4x248M-MoE

@dsikka dsikka marked this pull request as ready for review September 3, 2024 22:18
@rahul-tuli rahul-tuli self-assigned this Sep 4, 2024
kylesayrs
kylesayrs previously approved these changes Sep 8, 2024
Add more models, mistral and Qwen2
@kylesayrs
Copy link
Collaborator

Should consider adding a sentence like "mappings will normally be automatically inferred, but here's how to create your own custom ones" to https://github.com/vllm-project/llm-compressor/pull/115/files

kylesayrs
kylesayrs previously approved these changes Oct 3, 2024
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. As mentioned, including this in the documentation will ensure that this feature actually gets used by users

Point users to readme
make mappings inference a static function to make it easily testable
@mgoin mgoin merged commit 7c2ab3a into main Oct 4, 2024
6 of 7 checks passed
@mgoin mgoin deleted the smoothquant-mappings-ux branch October 4, 2024 22:54
markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants