Skip to content

feat: Implement Two-Sided Clipping for GRPO Trainer #3434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 13, 2025

Conversation

ucalyptus
Copy link
Contributor

@ucalyptus ucalyptus commented May 12, 2025

As seen in the new Prime Intellect's Intellect-2 Technical Report

What does this PR do?

This PR introduces a two-sided clipping mechanism to the GRPO (Group Relative Policy Optimization) trainer. This change aims to address a potential stability issue in the standard GRPO formulation where, for negative advantages ((\hat{A}_{i,t} < 0)), the original clipping only applied when the probability ratio was too small. This could lead to extremely large updates if the ratio became very large.

The core modification implements the following objective:
image

A new hyperparameter, delta, has been added to GRPOConfig. This parameter caps the probability ratio for negative advantages. It is recommended to set (\delta > 1+\epsilon) to allow significant updates while preventing extreme changes that could destabilize training.

The changes include:

  1. trl/trainer/grpo_config.py: Added the delta hyperparameter.
  2. trl/trainer/grpo_trainer.py: Modified _compute_loss to implement the two-sided clipping.
  3. tests/test_grpo_trainer.py: Added test_two_sided_clipping_loss to verify the new logic.

Fixes #3435

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section? (Assumed based on PR process)
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? (Docstring for delta in GRPOConfig was added.)
  • Did you write any new necessary tests? (The test_two_sided_clipping_loss test was added.)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ucalyptus
Copy link
Contributor Author

@qgallouedec

@kashif
Copy link
Collaborator

kashif commented May 12, 2025

@ucalyptus what are your thoughts on making this optional with regards to the default config values?

@ucalyptus
Copy link
Contributor Author

@kashif current implementation already allows users to get the old behavior by setting delta=float('inf'), so the mechanism for optionality is there.

@kashif
Copy link
Collaborator

kashif commented May 12, 2025

what i mean is to then have the default be inf and a recommendation in the help message for good values perhaps else code using the default values might change the behaviour unexpectedly

@qgallouedec
Copy link
Member

Agree with @kashif. Maybe even None for readability.

@ucalyptus
Copy link
Contributor Author

Ok so here are the changes I made:

GRPOConfig now defaults delta to None and has an updated docstring.
GRPOTrainer._compute_loss now only applies the upper delta clipping when delta is explicitly set to a float value.

Feel free to push back if there are further concerns.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kashif
Copy link
Collaborator

kashif commented May 12, 2025

@ucalyptus can you kindly run the make precommit in the root dir of TRL to fix the formatting?

@kashif
Copy link
Collaborator

kashif commented May 13, 2025

@ucalyptus the failing CI test is due to a known issue in the dev so can be ignored

@kashif
Copy link
Collaborator

kashif commented May 13, 2025

@ucalyptus do we want to add a warning that this config will be ignored when the Liger kernel losss is enabled?

@edbeeching
Copy link
Collaborator

@ucalyptus do we want to add a warning that this config will be ignored when the Liger kernel losss is enabled?

Gvien that logs are are flooded with info / warnings from transformers and other libs. It might be better to explicitly raise a ValueError when liger + delta clip is enabled together. So the user can disable it and won't draw any wrong conclusions.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the quick implementation of this feature @ucalyptus !

The trainer logic looks sound to me, but the tests need to be refactored with parameterised so that we can maintain them / better catch regressions.

I also see quite a few unnecessary code comments, so please remove most of them, except in cases where some clarification is required.

@ucalyptus ucalyptus requested a review from lewtun May 13, 2025 15:59
@ucalyptus
Copy link
Contributor Author

@kashif thanks for helping with the PR, and comments by @lewtun. Let me know if there are any outstanding concerns left.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating @ucalyptus and @kashif ! LGTM with some final nits

@kashif kashif merged commit 05bc43e into huggingface:main May 13, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Two-Sided Clipping for GRPO Trainer
6 participants