Skip to content

Facing Issue while running it on Multi-GPU DDP #3364

@karndeepsingh

Description

@karndeepsingh

❔Question

I have been trying to run the code on multiple gpus, but it get stuck as soon as it starts with first epoch. Any suggestion to resolve this issue, would be helpful. I am attaching the whole run here. Please help.

Additional context


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


github: skipping check (offline)
YOLOv5 🚀 v5.0-115-g407dc50 torch 1.8.1+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019.4375MB)
CUDA:1 (NVIDIA GeForce RTX 2080 Ti, 11014.375MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Namespace(adam=False, artifact_alias='latest', batch_size=1, bbox_interval=-1, bucket='', cache_images=False, cfg='', data='./data/coco128.yaml', device='0,1', entity=None, epochs=300, evolve=False, exist_ok=False, global_rank=0, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=0, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp28', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=2, upload_dataset=False, weights='yolov5x6.pt', workers=8, world_size=2)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)
Overriding model.yaml nc=80 with nc=9

             from  n    params  module                                  arguments                     

0 -1 1 8800 models.common.Focus [3, 80, 3]
1 -1 1 115520 models.common.Conv [80, 160, 3, 2]
2 -1 1 309120 models.common.C3 [160, 160, 4]
3 -1 1 461440 models.common.Conv [160, 320, 3, 2]
4 -1 1 3285760 models.common.C3 [320, 320, 12]
5 -1 1 1844480 models.common.Conv [320, 640, 3, 2]
6 -1 1 13125120 models.common.C3 [640, 640, 12]
7 -1 1 5531520 models.common.Conv [640, 960, 3, 2]
8 -1 1 11070720 models.common.C3 [960, 960, 4]
9 -1 1 11061760 models.common.Conv [960, 1280, 3, 2]
10 -1 1 4099840 models.common.SPP [1280, 1280, [3, 5, 7]]
11 -1 1 19676160 models.common.C3 [1280, 1280, 4, False]
12 -1 1 1230720 models.common.Conv [1280, 960, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 8] 1 0 models.common.Concat [1]
15 -1 1 11992320 models.common.C3 [1920, 960, 4, False]
16 -1 1 615680 models.common.Conv [960, 640, 1, 1]
17 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
18 [-1, 6] 1 0 models.common.Concat [1]
19 -1 1 5332480 models.common.C3 [1280, 640, 4, False]
20 -1 1 205440 models.common.Conv [640, 320, 1, 1]
21 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
22 [-1, 4] 1 0 models.common.Concat [1]
23 -1 1 1335040 models.common.C3 [640, 320, 4, False]
24 -1 1 922240 models.common.Conv [320, 320, 3, 2]
25 [-1, 20] 1 0 models.common.Concat [1]
26 -1 1 4922880 models.common.C3 [640, 640, 4, False]
27 -1 1 3687680 models.common.Conv [640, 640, 3, 2]
28 [-1, 16] 1 0 models.common.Concat [1]
29 -1 1 11377920 models.common.C3 [1280, 960, 4, False]
30 -1 1 8296320 models.common.Conv [960, 960, 3, 2]
31 [-1, 12] 1 0 models.common.Concat [1]
32 -1 1 20495360 models.common.C3 [1920, 1280, 4, False]
33 [23, 26, 29, 32] 1 134568 models.yolo.Detect [9, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [320, 640, 960, 1280]]
Model Summary: 773 layers, 141138888 parameters, 141138888 gradients, 221.6 GFLOPS

Transferred 1004/1012 items from yolov5x6.pt
Scaled weight_decay = 0.0005
Optimizer groups: 171 .bias, 171 conv.weight, 167 other
train: Scanning '/home/gsatis/Documents/karndeep/objectdetection/train/labels' i
train: New cache created: /home/gsatis/Documents/karndeep/objectdetection/train/labels.cache
val: Scanning '/home/gsatis/Documents/karndeep/objectdetection/test/labels' imag
val: New cache created: /home/gsatis/Documents/karndeep/objectdetection/test/labels.cache
Plotting labels...
train: Scanning '/home/gsatis/Documents/karndeep/objectdetection/train/labels.ca

autoanchor: Analyzing anchors... anchors/target = 3.12, Best Possible Recall (BPR) = 0.9358. Attempting to improve anchors, please wait...
autoanchor: Running kmeans for 12 anchors on 592 points...
autoanchor: thr=0.25: 1.0000 best possible recall, 6.91 anchors past thr
autoanchor: n=12, img_size=640, metric_all=0.354/0.805-mean/best, past_thr=0.513-mean: 66,14, 106,16, 152,21, 216,36, 142,57, 367,30, 290,47, 254,105, 338,114, 251,165, 335,180, 311,246
autoanchor: Evolving anchors with Genetic Algorithm: fitness = 0.8152: 100%|█| 1
autoanchor: thr=0.25: 1.0000 best possible recall, 7.08 anchors past thr
autoanchor: n=12, img_size=640, metric_all=0.361/0.815-mean/best, past_thr=0.515-mean: 69,13, 110,15, 137,25, 148,45, 203,34, 369,28, 269,52, 278,93, 252,145, 334,114, 325,171, 302,227
autoanchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.

Image sizes 640 train, 640 test
Using 0 dataloader workers
Logging results to runs/train/exp28
Starting training for 300 epochs...

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
 0/299     8.79G    0.1024   0.02105   0.06316    0.1866         2       640Reducer buckets have been rebuilt in this iteration.

/home/gsatis/anaconda3/envs/yolov5_3.6/lib/python3.7/site-packages/torch/jit/_trace.py:728: UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as is.
"The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as is."
0/299 8.98G 0.08874 0.02311 0.05468 0.1665 2 640

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleStale and schedule for closing soonquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions