-
-
Notifications
You must be signed in to change notification settings - Fork 17.2k
Closed
Labels
StaleStale and schedule for closing soonStale and schedule for closing soonbugSomething isn't workingSomething isn't working
Description
I'm running training on 2 GPUs without any problems as follows:
python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111 train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml
However, if I have to kill the job (so someone else can use the processors for a bit), I can not restart the training. I've tried
python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111 train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml --resume
and a number of variants by leaving out different arguments related to multi-processors. The training gets to:
Transferred 370/370 items from ./runs/exp1/weights/last.pt
Using DDP
sits for a few seconds, then issues the second
Using DDP
and then just hangs. On the GPU, 3 processors are started. Two on processor 0, each using 2250Mb, and one on processor 1 using 965Mb.
Any ideas?
leedewdew
Metadata
Metadata
Assignees
Labels
StaleStale and schedule for closing soonStale and schedule for closing soonbugSomething isn't workingSomething isn't working