"Loss is nan, stopping train" appears regularly

I followed the steps in the READ.ME, configured the file directory structure, and trained the model. But there are always strange problems, like the log information intercepted below.

----OUTPUT----
Epoch: [5]  [1660/2241]  eta: 0:08:56  lr: 0.003000  loss: 2.2882 (2.4257)  loss_proposal_cls: 0.0818 (0.0915)  loss_proposal_reg: 1.2728 (1.4000)  loss_box_cls: 0.1167 (0.1311)  loss_box_reg: 0.1667 (0.1707)  loss_box_reid: 0.4618 (0.5611)  loss_rpn_reg: 0.0283 (0.0344)  loss_rpn_cls: 0.0317 (0.0369)  time: 0.9248  data: 0.0005  max mem: 24005
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.0837, device='cuda:0', grad_fn=<MulBackward0>), 'loss_proposal_reg': tensor(1.3923, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_cls': tensor(0.1187, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reg': tensor(0.1719, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_reg': tensor(0.0457, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.0226, device='cuda:0', grad_fn=<MulBackward0>)}

This phenomenon occurs after executing a fixed number of epochs. The error "Loss is nan, stopping training" is very regular. For example, after 5 epochs, it will appear after the 1160th batch of the 6th epoch, whether it is training from epoch=0 or using the _--resume_ command .

Whether the model is trained on the RTX A6000，RTX A5000 or Tesla V100 32G, or whether the batch size and learning rate are adjusted in equal proportions, this error will occur, thus stopping the training.

I used the _--resume_ command to train for 20 epochs, and observed that every time the problem appeared on the _loss\_box\_reid_.

This should be a bug in the code, but I'm not quite sure how it came about and how to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Loss is nan, stopping train" appears regularly #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

"Loss is nan, stopping train" appears regularly #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions