Unable to resume training when using DDP

I'm running training on 2 GPUs without any problems as follows:

     python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml

However, if I have to kill the job (so someone else can use the processors for a bit), I can not restart the training. I've tried

    python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml --resume 

and a number of variants by leaving out different arguments related to multi-processors. The training gets to:

```
Transferred 370/370 items from ./runs/exp1/weights/last.pt
Using DDP
```
sits for a few seconds, then issues the second
```
Using DDP
```
and then just hangs. On the GPU, 3 processors are started. Two on processor 0, each using 2250Mb, and one on processor 1 using 965Mb. 

Any ideas?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unable to resume training when using DDP #851

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unable to resume training when using DDP #851

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions