Facing Issue while running it on Multi-GPU DDP

## ❔Question
I have been trying to run the code on multiple gpus, but it get stuck as soon as it starts with first epoch. Any suggestion to resolve this issue, would be helpful. I am attaching the whole run here. Please help.

## Additional context
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
github: skipping check (offline)
YOLOv5 🚀 v5.0-115-g407dc50 torch 1.8.1+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019.4375MB)
                                             CUDA:1 (NVIDIA GeForce RTX 2080 Ti, 11014.375MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Namespace(adam=False, artifact_alias='latest', batch_size=1, bbox_interval=-1, bucket='', cache_images=False, cfg='', data='./data/coco128.yaml', device='0,1', entity=None, epochs=300, evolve=False, exist_ok=False, global_rank=0, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=0, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp28', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=2, upload_dataset=False, weights='yolov5x6.pt', workers=8, world_size=2)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)
Overriding model.yaml nc=80 with nc=9

                 from  n    params  module                                  arguments                     
  0                -1  1      8800  models.common.Focus                     [3, 80, 3]                    
  1                -1  1    115520  models.common.Conv                      [80, 160, 3, 2]               
  2                -1  1    309120  models.common.C3                        [160, 160, 4]                 
  3                -1  1    461440  models.common.Conv                      [160, 320, 3, 2]              
  4                -1  1   3285760  models.common.C3                        [320, 320, 12]                
  5                -1  1   1844480  models.common.Conv                      [320, 640, 3, 2]              
  6                -1  1  13125120  models.common.C3                        [640, 640, 12]                
  7                -1  1   5531520  models.common.Conv                      [640, 960, 3, 2]              
  8                -1  1  11070720  models.common.C3                        [960, 960, 4]                 
  9                -1  1  11061760  models.common.Conv                      [960, 1280, 3, 2]             
 10                -1  1   4099840  models.common.SPP                       [1280, 1280, [3, 5, 7]]       
 11                -1  1  19676160  models.common.C3                        [1280, 1280, 4, False]        
 12                -1  1   1230720  models.common.Conv                      [1280, 960, 1, 1]             
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 14           [-1, 8]  1         0  models.common.Concat                    [1]                           
 15                -1  1  11992320  models.common.C3                        [1920, 960, 4, False]         
 16                -1  1    615680  models.common.Conv                      [960, 640, 1, 1]              
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 18           [-1, 6]  1         0  models.common.Concat                    [1]                           
 19                -1  1   5332480  models.common.C3                        [1280, 640, 4, False]         
 20                -1  1    205440  models.common.Conv                      [640, 320, 1, 1]              
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 22           [-1, 4]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1335040  models.common.C3                        [640, 320, 4, False]          
 24                -1  1    922240  models.common.Conv                      [320, 320, 3, 2]              
 25          [-1, 20]  1         0  models.common.Concat                    [1]                           
 26                -1  1   4922880  models.common.C3                        [640, 640, 4, False]          
 27                -1  1   3687680  models.common.Conv                      [640, 640, 3, 2]              
 28          [-1, 16]  1         0  models.common.Concat                    [1]                           
 29                -1  1  11377920  models.common.C3                        [1280, 960, 4, False]         
 30                -1  1   8296320  models.common.Conv                      [960, 960, 3, 2]              
 31          [-1, 12]  1         0  models.common.Concat                    [1]                           
 32                -1  1  20495360  models.common.C3                        [1920, 1280, 4, False]        
 33  [23, 26, 29, 32]  1    134568  models.yolo.Detect                      [9, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [320, 640, 960, 1280]]
Model Summary: 773 layers, 141138888 parameters, 141138888 gradients, 221.6 GFLOPS

Transferred 1004/1012 items from yolov5x6.pt
Scaled weight_decay = 0.0005
Optimizer groups: 171 .bias, 171 conv.weight, 167 other
train: Scanning '/home/gsatis/Documents/karndeep/objectdetection/train/labels' i
train: New cache created: /home/gsatis/Documents/karndeep/objectdetection/train/labels.cache
val: Scanning '/home/gsatis/Documents/karndeep/objectdetection/test/labels' imag
val: New cache created: /home/gsatis/Documents/karndeep/objectdetection/test/labels.cache
Plotting labels... 
train: Scanning '/home/gsatis/Documents/karndeep/objectdetection/train/labels.ca

autoanchor: Analyzing anchors... anchors/target = 3.12, Best Possible Recall (BPR) = 0.9358. Attempting to improve anchors, please wait...
autoanchor: Running kmeans for 12 anchors on 592 points...
autoanchor: thr=0.25: 1.0000 best possible recall, 6.91 anchors past thr
autoanchor: n=12, img_size=640, metric_all=0.354/0.805-mean/best, past_thr=0.513-mean: 66,14,  106,16,  152,21,  216,36,  142,57,  367,30,  290,47,  254,105,  338,114,  251,165,  335,180,  311,246
autoanchor: Evolving anchors with Genetic Algorithm: fitness = 0.8152: 100%|█| 1
autoanchor: thr=0.25: 1.0000 best possible recall, 7.08 anchors past thr
autoanchor: n=12, img_size=640, metric_all=0.361/0.815-mean/best, past_thr=0.515-mean: 69,13,  110,15,  137,25,  148,45,  203,34,  369,28,  269,52,  278,93,  252,145,  334,114,  325,171,  302,227
autoanchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.

Image sizes 640 train, 640 test
Using 0 dataloader workers
Logging results to runs/train/exp28
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     0/299     8.79G    0.1024   0.02105   0.06316    0.1866         2       640Reducer buckets have been rebuilt in this iteration.
/home/gsatis/anaconda3/envs/yolov5_3.6/lib/python3.7/site-packages/torch/jit/_trace.py:728: UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as is.
  "The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as is."
     0/299     8.98G   0.08874   0.02311   0.05468    0.1665         2       640


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Facing Issue while running it on Multi-GPU DDP #3364

❔Question

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Facing Issue while running it on Multi-GPU DDP #3364

Description

❔Question

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions