some bugs when training

Hello, I ran into a difficult problem when using yolov5. Reinstalling the system did not help this problem. I was very confused about what happened. This problem has been bothering me for several days. I have closed the previous one. Questions and gave me detailed bugs, can you give me some help？？I can test and detect but I can not train.


## 🐛 Bug
have some problems when training


## To Reproduce (REQUIRED)

Input:
```
adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
```

Output:
Using torch 1.7.0+cu101 CUDA:0 (GeForce RTX 2080 Ti, 10997MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=7

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     19904  models.common.BottleneckCSP             [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    641792  models.common.BottleneckCSP             [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    378624  models.common.BottleneckCSP             [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     95104  models.common.BottleneckCSP             [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    313088  models.common.BottleneckCSP             [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1     32364  models.yolo.Detect                      [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7271276 parameters, 7271276 gradients

Transferred 364/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning 'coco128/labels/train2017.cache' for images and labels... 3219 found, 0 missing, 20 empty, 0 corrupted: 100%|██████████| 3219/3219 [00:00<?, ?it/s]
Scanning 'coco128/labels/val.cache' for images and labels... 246 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 248/248 [00:00<?, ?it/s]

Analyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9894
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|          | 0/202 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/ljy/yolov5-master/train.py", line 492, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "/home/ljy/yolov5-master/train.py", line 293, in train
    scaler.scale(loss).backward()
  File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 256, 20, 20], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7f8ba4002b60
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 16, 256, 20, 20, 
    strideA = 102400, 400, 20, 1, 
output: TensorDescriptor 0x7f8ba40033a0
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 16, 256, 20, 20, 
    strideA = 102400, 400, 20, 1, 
weight: FilterDescriptor 0x7f8ba403e080
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 256, 256, 3, 3, 
Pointer addresses: 
    input: 0x7f8a73b60000
    output: 0x7f8c792e0000
    weight: 0x7f8d5b660000


Process finished with exit code 1






## Environment
If applicable, add screenshots to help explain your problem.
-cudnn 7.6.4
-nvidia-driver 440.95
 - OS: [e.g. Ubuntu20.04]
 - GPU [e.g. 2080 Ti]
-torch 1.7.0+cu101
-cuda 10.1
![2020-11-28 16-06-58 的屏幕截图](https://user-images.githubusercontent.com/32092988/100497531-0d6cf800-3197-11eb-9d57-a87d079d948e.png)
![2020-11-28 16-06-58 的屏幕截图](https://user-images.githubusercontent.com/32092988/100497541-1eb60480-3197-11eb-840f-c324fcbe51ed.png)
![2020-11-28 16-07-04 的屏幕截图](https://user-images.githubusercontent.com/32092988/100497542-21185e80-3197-11eb-8a09-5d1866d62377.png)
![2020-11-28 16-08-32 的屏幕截图](https://user-images.githubusercontent.com/32092988/100497544-22e22200-3197-11eb-868f-0333a0793d3f.png)
![2020-11-28 16-08-37 的屏幕截图](https://user-images.githubusercontent.com/32092988/100497545-237ab880-3197-11eb-8396-223e385dc2e5.png)
![2020-11-28 16-08-59 的屏幕截图](https://user-images.githubusercontent.com/32092988/100497546-24134f00-3197-11eb-897a-2dfb83bf8954.png)


## Additional context
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

some bugs when training #1547

🐛 Bug

To Reproduce (REQUIRED)

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

some bugs when training #1547

Description

🐛 Bug

To Reproduce (REQUIRED)

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions