Skip to content

What‘s the meaning of clip_ratio in GRPO Trainer? #3144

@I-l-l-I

Description

@I-l-l-I

I am trying to understand the rationale behind the calculation of clip_ratio in the following code snippet:

coef_1 = torch.exp(per_token_logps - old_per_token_logps)
coef_2 = torch.clamp(coef_1, 1 - self.epsilon_low, 1 + self.epsilon_high)
per_token_loss1 = coef_1 * advantages.unsqueeze(1)
per_token_loss2 = coef_2 * advantages.unsqueeze(1)
per_token_loss = -torch.min(per_token_loss1, per_token_loss2)

is_clipped = (per_token_loss1 < per_token_loss2).float()
clip_ratio = (is_clipped * completion_mask).sum() / completion_mask.sum()
self._metrics[mode]["clip_ratio"].append(self.accelerator.gather_for_metrics(clip_ratio).mean().item())

If clip_ratio is intended to indicate how frequently the policy updates are constrained to prevent large changes, shouldn't the is_clipped be:

is_clipped = (per_token_loss1 > per_token_loss2).float()

since we are using torch.min(per_token_loss1, per_token_loss2).

I would appreciate any insights or clarification on this matter. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    ❓ questionSeeking clarification or more information🏋 GRPORelated to GRPO

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions