Skip to content

Unable to resume training when using DDP #851

@bsugerman

Description

@bsugerman

I'm running training on 2 GPUs without any problems as follows:

 python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml

However, if I have to kill the job (so someone else can use the processors for a bit), I can not restart the training. I've tried

python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml --resume 

and a number of variants by leaving out different arguments related to multi-processors. The training gets to:

Transferred 370/370 items from ./runs/exp1/weights/last.pt
Using DDP

sits for a few seconds, then issues the second

Using DDP

and then just hangs. On the GPU, 3 processors are started. Two on processor 0, each using 2250Mb, and one on processor 1 using 965Mb.

Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleStale and schedule for closing soonbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions