What are the reasons `num_iterations` (μ) defaulted to 1 in GRPO trainer? #3548

JenWei0312 · 2025-06-08T17:55:39Z

JenWei0312
Jun 8, 2025

Hi! I've been studying the GRPO implementation and noticed that num_iterations defaults to 1.

From my understanding of the DeepSeek paper, setting μ > 1 is a key feature that allows multiple policy updates from a single generation batch, improving computational efficiency.

Could you help me understand:

Is there a specific reason for defaulting to μ=1?
Would it make sense to increase the default (e.g., to 4 or 8) to

Better represent how GRPO was intended to be used
Give users better compute efficiency out of the box
Align with the paper's approach

I'm asking because many users tend to use default values. If my understanding is correct, then they might be missing out on GRPO's efficiency benefits.

Thanks for the great work on this trainer, and please let me know if I missed anything. 🙏

qgallouedec · 2025-06-08T18:07:16Z

qgallouedec
Jun 8, 2025
Maintainer

Great question — yes, μ=1 is the default in the DeepSeek Math paper and in the current implementation.

It does make sense to increase num_iterations (e.g. to 2–4) to better leverage GRPO’s compute efficiency by reusing generated data. However, we’ve found that values above 2–4 can hurt stability in practice, so we kept the default at 1 to be conservative.

Also, another way to improve efficiency is by increasing steps_per_generation, since larger generation batches are much faster due to better hardware utilization.

5 replies

JenWei0312 Jun 8, 2025
Author

Thanks for the clarification! That makes sense about stability concerns with higher μ values.

I just realized another potential issue with the current defaults: with β=0 (no KL), μ=1, and loss_type="bnpo", wouldn't the loss value be zero (or very close) throughout training?

Since:

π_θ/π_θ_old = 1 when μ=1 (at gradient computation time)
Σ Â = 0 (advantages sum to zero by construction)
No KL term to add to the loss

While the gradients still flow correctly, having loss≈0 throughout training could be confusing for users monitoring their training runs. They might think something is wrong when the loss doesn't decrease (since it's already at ~0).

Would it make sense to either:

Add a note in the docs about this expected behavior, or
Consider different defaults that produce more informative loss values?

Just thinking about user experience when debugging their training runs.

qgallouedec Jun 8, 2025
Maintainer

That's a very good point, and it's indeed a question we've been asked quite often see here. It might be worth adding a note about this in the documentation (we're open to contributions).

That said, I don’t think changing the default to produce a more “expected” or familiar-looking loss is a good idea. In GRPO, the loss is expected to increase when beta > 0 and stay around zero when beta = 0. Presenting a more conventional value could be misleading, as users might misinterpret it as a sign of good performance. We assume users are aware of this behavior and understand how to interpret the raw loss accordingly.

JenWei0312 Jun 9, 2025
Author

Thanks for the thoughtful response! I see from your detailed explanation in the linked issue that you've already written a great mathematical explanation of this behavior.

Would it be helpful if I:

Adapt your explanation from that issue into the main documentation (with proper attribution, of course)?
Add a brief "quick answer" section for users who just want to know if their training is working correctly?
Explicitly mention how this relates to the current default settings?

I'm thinking something like a "Understanding GRPO Loss Behavior" section that starts with a quick summary and then includes your detailed mathematical explanation for those who want to dig deeper.

Happy to draft something for your review!

qgallouedec Jun 9, 2025
Maintainer

It depends on how much detail you want. I'd like to keep the documentation as concise and actionable as possible. Having a subsection that says that loss is supposed to increase is fine, but if you want to include the full explanation, then I'd rather start a blog post, which could be linked to in the documentation. What do you think?

JenWei0312 Jun 11, 2025
Author

That makes sense.

I've drafted a short documentation section for you to review.

### Understanding GRPO Loss Behavior

With the default setting of one iteration per batch (μ=1), you may notice some unusual loss behavior:

- **When β=0 (default, no KL penalty)**: Loss remains near zero throughout training
- **When β>0 (with KL penalty)**: Loss starts at zero and gradually increases

Both behaviors are normal and expected. Here's why:

**Key points**:
- The model is still training correctly - check that gradients are non-zero
- With β>0, increasing loss actually indicates your model is learning (it's diverging from the reference policy, which means it's learning new patterns)
- Monitor reward/accuracy metrics rather than loss for training progress

For a detailed mathematical explanation, see [this blog post](link).

If this looks good, I can submit a PR request, and start drafting a longer blog post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What are the reasons `num_iterations` (μ) defaulted to 1 in GRPO trainer? #3548

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What are the reasons num_iterations (μ) defaulted to 1 in GRPO trainer? #3548

Uh oh!

JenWei0312 Jun 8, 2025

Replies: 1 comment · 5 replies

Uh oh!

qgallouedec Jun 8, 2025 Maintainer

Uh oh!

Uh oh!

JenWei0312 Jun 8, 2025 Author

Uh oh!

qgallouedec Jun 8, 2025 Maintainer

Uh oh!

JenWei0312 Jun 9, 2025 Author

Uh oh!

qgallouedec Jun 9, 2025 Maintainer

Uh oh!

JenWei0312 Jun 11, 2025 Author

What are the reasons `num_iterations` (μ) defaulted to 1 in GRPO trainer? #3548

JenWei0312
Jun 8, 2025

Replies: 1 comment 5 replies

qgallouedec
Jun 8, 2025
Maintainer

JenWei0312 Jun 8, 2025
Author

qgallouedec Jun 8, 2025
Maintainer

JenWei0312 Jun 9, 2025
Author

qgallouedec Jun 9, 2025
Maintainer

JenWei0312 Jun 11, 2025
Author