Skip to content

Conversation

@jameslovespancakes
Copy link

Fixes #41898

When drop_last=False (default), the last batch may contain fewer samples than per_device_eval_batch_size. Using a fixed batch_size to repeat the scalar loss causes the last batch to be over-represented in the final average loss calculation.

Changes:

  • Trainer: Use observed_batch_size instead of fixed batch_size when repeating eval loss for gather_for_metrics
  • no_trainer examples: Use actual batch size from input_ids.shape[0] for both eval and train loss computation
  • Train loss: Weight by actual batch size and divide by total samples instead of number of batches

This ensures accurate loss computation regardless of batch size variability while maintaining backward compatibility (identical behavior when all batches are uniform size).

Fixes huggingface#41898

When drop_last=False (default), the last batch may contain fewer samples
than per_device_eval_batch_size. Using a fixed batch_size to repeat the
scalar loss causes the last batch to be over-represented in the final
average loss calculation.

Changes:
- Trainer: Use observed_batch_size instead of fixed batch_size when
  repeating eval loss for gather_for_metrics
- no_trainer examples: Use actual batch size from input_ids.shape[0]
  for both eval and train loss computation
- Train loss: Weight by actual batch size and divide by total samples
  instead of number of batches

This ensures accurate loss computation regardless of batch size variability
while maintaining backward compatibility (identical behavior when all
batches are uniform size).
@Rocketknight1
Copy link
Member

The updates to the no_trainer examples look okay, but I'd like @SunMarc's confirmation about the change in trainer.py!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

inaccurate eval loss computation

2 participants