feat: Implement Two-Sided Clipping for GRPO Trainer #3434

ucalyptus · 2025-05-12T05:06:58Z

As seen in the new Prime Intellect's Intellect-2 Technical Report

What does this PR do?

This PR introduces a two-sided clipping mechanism to the GRPO (Group Relative Policy Optimization) trainer. This change aims to address a potential stability issue in the standard GRPO formulation where, for negative advantages ((\hat{A}_{i,t} < 0)), the original clipping only applied when the probability ratio was too small. This could lead to extremely large updates if the ratio became very large.

The core modification implements the following objective:

A new hyperparameter, delta, has been added to GRPOConfig. This parameter caps the probability ratio for negative advantages. It is recommended to set (\delta > 1+\epsilon) to allow significant updates while preventing extreme changes that could destabilize training.

The changes include:

trl/trainer/grpo_config.py: Added the delta hyperparameter.
trl/trainer/grpo_trainer.py: Modified _compute_loss to implement the two-sided clipping.
tests/test_grpo_trainer.py: Added test_two_sided_clipping_loss to verify the new logic.

Fixes #3435

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section? (Assumed based on PR process)
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? (Docstring for delta in GRPOConfig was added.)
Did you write any new necessary tests? (The test_two_sided_clipping_loss test was added.)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

ucalyptus · 2025-05-12T05:09:15Z

@qgallouedec

kashif · 2025-05-12T15:01:43Z

@ucalyptus what are your thoughts on making this optional with regards to the default config values?

ucalyptus · 2025-05-12T16:47:00Z

@kashif current implementation already allows users to get the old behavior by setting delta=float('inf'), so the mechanism for optionality is there.

kashif · 2025-05-12T17:27:33Z

what i mean is to then have the default be inf and a recommendation in the help message for good values perhaps else code using the default values might change the behaviour unexpectedly

qgallouedec · 2025-05-12T17:37:58Z

Agree with @kashif. Maybe even None for readability.

ucalyptus · 2025-05-12T18:25:17Z

Ok so here are the changes I made:

GRPOConfig now defaults delta to None and has an updated docstring.
GRPOTrainer._compute_loss now only applies the upper delta clipping when delta is explicitly set to a float value.

Feel free to push back if there are further concerns.

HuggingFaceDocBuilderDev · 2025-05-12T18:34:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kashif · 2025-05-12T18:37:38Z

@ucalyptus can you kindly run the make precommit in the root dir of TRL to fix the formatting?

kashif · 2025-05-13T09:36:57Z

@ucalyptus the failing CI test is due to a known issue in the dev so can be ignored

kashif · 2025-05-13T09:38:33Z

@ucalyptus do we want to add a warning that this config will be ignored when the Liger kernel losss is enabled?

edbeeching · 2025-05-13T11:34:15Z

@ucalyptus do we want to add a warning that this config will be ignored when the Liger kernel losss is enabled?

Gvien that logs are are flooded with info / warnings from transformers and other libs. It might be better to explicitly raise a ValueError when liger + delta clip is enabled together. So the user can disable it and won't draw any wrong conclusions.

tests/test_grpo_trainer.py

lewtun

Thank you very much for the quick implementation of this feature @ucalyptus !

The trainer logic looks sound to me, but the tests need to be refactored with parameterised so that we can maintain them / better catch regressions.

I also see quite a few unnecessary code comments, so please remove most of them, except in cases where some clarification is required.

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

tests/test_grpo_trainer.py

Co-authored-by: lewtun <[email protected]>

ucalyptus · 2025-05-13T16:37:07Z

@kashif thanks for helping with the PR, and comments by @lewtun. Let me know if there are any outstanding concerns left.

lewtun

Thanks for iterating @ucalyptus and @kashif ! LGTM with some final nits

tests/test_grpo_trainer.py

Co-authored-by: lewtun <[email protected]>

grpo two sided clipping

7433d3f

changes in default behavior

93a9cfe

style: apply ruff formatting

8b5c929

Update grpo_config.py

dc08b5f

kashif approved these changes May 13, 2025

View reviewed changes

kashif reviewed May 13, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

Update tests/test_grpo_trainer.py

00bccd2

kashif reviewed May 13, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

Update tests/test_grpo_trainer.py

ed54f43

kashif reviewed May 13, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

Update tests/test_grpo_trainer.py

ffb6cff

lewtun requested changes May 13, 2025

View reviewed changes

kashif and others added 6 commits May 13, 2025 16:56

Update trl/trainer/grpo_config.py

1f2c12e

Co-authored-by: lewtun <[email protected]>

Update trl/trainer/grpo_trainer.py

ab13df8

Co-authored-by: lewtun <[email protected]>

revert unrelated test

f11a6bc

Merge branch 'main' into feat/two-sided-grpo-clipping

b8a6e99

parameterized test_two_sided_clipping_loss

15f0d14

model has pad token

d267ef8

ucalyptus requested a review from lewtun May 13, 2025 15:59

lewtun approved these changes May 13, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

kashif and others added 5 commits May 13, 2025 20:12

Update tests/test_grpo_trainer.py

cfa262a

Co-authored-by: lewtun <[email protected]>

Update tests/test_grpo_trainer.py

5c5f701

Co-authored-by: lewtun <[email protected]>

Update tests/test_grpo_trainer.py

1639350

Co-authored-by: lewtun <[email protected]>

Merge branch 'main' into feat/two-sided-grpo-clipping

7967753

Update tests/test_grpo_trainer.py

791046b

Co-authored-by: lewtun <[email protected]>

kashif merged commit 05bc43e into huggingface:main May 13, 2025
9 of 10 checks passed

qgallouedec mentioned this pull request May 27, 2025

🐌 Clean two-sided clipping #3499

Merged

hjh0119 mentioned this pull request Jun 3, 2025

[grpo] Two-Sided Clipping for GRPO Trainer modelscope/ms-swift#4450

Merged

4 tasks

feat: Implement Two-Sided Clipping for GRPO Trainer #3434

feat: Implement Two-Sided Clipping for GRPO Trainer #3434

Uh oh!

Conversation

ucalyptus commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

ucalyptus commented May 12, 2025

Uh oh!

kashif commented May 12, 2025

Uh oh!

ucalyptus commented May 12, 2025

Uh oh!

kashif commented May 12, 2025

Uh oh!

qgallouedec commented May 12, 2025

Uh oh!

ucalyptus commented May 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2025

Uh oh!

kashif commented May 12, 2025

Uh oh!

kashif commented May 13, 2025

Uh oh!

kashif commented May 13, 2025

Uh oh!

edbeeching commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ucalyptus commented May 13, 2025

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ucalyptus commented May 12, 2025 •

edited

Loading