Skip to content

Multi GPU training is failling #6297

@hdnh2006

Description

@hdnh2006

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Multi-GPU

Bug

I was trying to train in my own dataset with multi GPU training with the following code:

python -m torch.distributed.launch --nproc_per_node 4 train.py --weights yolov5s.pt --data my_dataset/my_dataset.yaml --epochs 300 --batch-size 256 --workers 32 --cache disk --patience 100 --device 0,1,2,3

And I am getting this error:


*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 521, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
    assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested'  # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 521, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
    assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested'  # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 521, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
    assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested'  # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
wandb: Currently logged in as: hdnh2006 (use `wandb login --relogin` to force relogin)
train: weights=yolov5s.pt, cfg=, data=***/***.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=256, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=disk, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=32, project=https://wandb.ai/**/**, name=**, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 521, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
 File "/home/ubuntu/yolov5/yolov5/utils/torch_utils.py", line 65, in select_device
    assert torch.cuda.device_count() > int(device), f'invalid CUDA device {device} requested'  # check index
ValueError: invalid literal for int() with base 10: '0,1,2,3'
Killing subprocess 72598
Killing subprocess 72599
Killing subprocess 72600
Killing subprocess 72601
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/ubuntu/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/ubuntu/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

Clearly it is taking the parameters device 0, 1, 2, 3 as just one parameter and not as before and it is running int([0,1,2,3]) which doesn't make any sence

Please could you help me with this?

Environment

  • YOLO: last commit
  • Ubuntu
  • Python 3.7

EDIT IMPORTANT: A really important note: returning to the previous commit (af00134) solves this issue, so take a look the commit you made a few hours ago

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleStale and schedule for closing soonbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions