-
-
Notifications
You must be signed in to change notification settings - Fork 17.2k
Closed
Labels
StaleStale and schedule for closing soonStale and schedule for closing soonbugSomething isn't workingSomething isn't working
Description
Search before asking
- I have searched the YOLOv5 issues and found no similar bug report.
YOLOv5 Component
Training, Multi-GPU
Bug
I was trying to train in my own dataset with multi GPU training with the following code:
python -m torch.distributed.launch --nproc_per_node 4 train.py --weights yolov5s.pt --data my_dataset/my_dataset.yaml --epochs 300 --batch-size 256 --workers 32 --cache disk --patience 100 --device 0,1,2,3
And I am getting this error:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "train.py", line 636, in <module>
main(opt)
File "train.py", line 521, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested' # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
Traceback (most recent call last):
File "train.py", line 636, in <module>
main(opt)
File "train.py", line 521, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested' # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
Traceback (most recent call last):
File "train.py", line 636, in <module>
main(opt)
File "train.py", line 521, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested' # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
wandb: Currently logged in as: hdnh2006 (use `wandb login --relogin` to force relogin)
train: weights=yolov5s.pt, cfg=, data=***/***.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=256, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=disk, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=32, project=https://wandb.ai/**/**, name=**, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
File "train.py", line 636, in <module>
main(opt)
File "train.py", line 521, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested' # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
Killing subprocess 72598
Killing subprocess 72599
Killing subprocess 72600
Killing subprocess 72601
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/ubuntu/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ubuntu/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
Clearly it is taking the parameters device 0, 1, 2, 3
as just one parameter and not as before and it is running int([0,1,2,3])
which doesn't make any sence
Please could you help me with this?
Environment
- YOLO: last commit
- Ubuntu
- Python 3.7
EDIT IMPORTANT: A really important note: returning to the previous commit (af00134) solves this issue, so take a look the commit you made a few hours ago
Metadata
Metadata
Assignees
Labels
StaleStale and schedule for closing soonStale and schedule for closing soonbugSomething isn't workingSomething isn't working