generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
❓ questionSeeking clarification or more informationSeeking clarification or more information🏋 GRPORelated to GRPORelated to GRPO
Description
I am trying to understand the rationale behind the calculation of clip_ratio
in the following code snippet:
coef_1 = torch.exp(per_token_logps - old_per_token_logps)
coef_2 = torch.clamp(coef_1, 1 - self.epsilon_low, 1 + self.epsilon_high)
per_token_loss1 = coef_1 * advantages.unsqueeze(1)
per_token_loss2 = coef_2 * advantages.unsqueeze(1)
per_token_loss = -torch.min(per_token_loss1, per_token_loss2)
is_clipped = (per_token_loss1 < per_token_loss2).float()
clip_ratio = (is_clipped * completion_mask).sum() / completion_mask.sum()
self._metrics[mode]["clip_ratio"].append(self.accelerator.gather_for_metrics(clip_ratio).mean().item())
If clip_ratio
is intended to indicate how frequently the policy updates are constrained to prevent large changes, shouldn't the is_clipped
be:
is_clipped = (per_token_loss1 > per_token_loss2).float()
since we are using torch.min(per_token_loss1, per_token_loss2)
.
I would appreciate any insights or clarification on this matter. Thank you!
Metadata
Metadata
Assignees
Labels
❓ questionSeeking clarification or more informationSeeking clarification or more information🏋 GRPORelated to GRPORelated to GRPO