-
Notifications
You must be signed in to change notification settings - Fork 2.1k
feat: Implement Two-Sided Clipping for GRPO Trainer #3434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement Two-Sided Clipping for GRPO Trainer #3434
Conversation
@ucalyptus what are your thoughts on making this optional with regards to the default config values? |
@kashif current implementation already allows users to get the old behavior by setting delta=float('inf'), so the mechanism for optionality is there. |
what i mean is to then have the default be |
Agree with @kashif. Maybe even |
Ok so here are the changes I made:
Feel free to push back if there are further concerns. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@ucalyptus can you kindly run the |
@ucalyptus the failing CI test is due to a known issue in the dev so can be ignored |
@ucalyptus do we want to add a warning that this config will be ignored when the Liger kernel losss is enabled? |
Gvien that logs are are flooded with info / warnings from transformers and other libs. It might be better to explicitly raise a ValueError when liger + delta clip is enabled together. So the user can disable it and won't draw any wrong conclusions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the quick implementation of this feature @ucalyptus !
The trainer logic looks sound to me, but the tests need to be refactored with parameterised
so that we can maintain them / better catch regressions.
I also see quite a few unnecessary code comments, so please remove most of them, except in cases where some clarification is required.
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating @ucalyptus and @kashif ! LGTM with some final nits
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
As seen in the new Prime Intellect's Intellect-2 Technical Report
What does this PR do?
This PR introduces a two-sided clipping mechanism to the GRPO (Group Relative Policy Optimization) trainer. This change aims to address a potential stability issue in the standard GRPO formulation where, for negative advantages ((\hat{A}_{i,t} < 0)), the original clipping only applied when the probability ratio was too small. This could lead to extremely large updates if the ratio became very large.
The core modification implements the following objective:

A new hyperparameter,
delta
, has been added toGRPOConfig
. This parameter caps the probability ratio for negative advantages. It is recommended to set (\delta > 1+\epsilon) to allow significant updates while preventing extreme changes that could destabilize training.The changes include:
trl/trainer/grpo_config.py
: Added thedelta
hyperparameter.trl/trainer/grpo_trainer.py
: Modified_compute_loss
to implement the two-sided clipping.tests/test_grpo_trainer.py
: Addedtest_two_sided_clipping_loss
to verify the new logic.Fixes #3435
Before submitting
Pull Request section? (Assumed based on PR process)
to it if that's the case.
delta
inGRPOConfig
was added.)test_two_sided_clipping_loss
test was added.)Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.