Skip to content

Commit 8e1f6e6

Browse files
authored
[Distributed]Add loss nan/inf checker (#8943)
* add checker * fix fp16
1 parent 1ea1f50 commit 8e1f6e6

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

paddlenlp/trainer/trainer.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -992,6 +992,10 @@ def _inner_training_loop(
992992
else:
993993
tr_loss_step = self.training_step(model, inputs)
994994

995+
if not args.fp16:
996+
if not paddle.isfinite(tr_loss_step).all().item():
997+
raise ValueError(f"Loss contains inf or nan values at rank {paddle.distributed.get_rank()}")
998+
995999
tr_loss += tr_loss_step
9961000

9971001
def fused_allreduce_gradients_no_sync(paramlist, hcg):

0 commit comments

Comments
 (0)