Skip to content

some bugs when training #1547

@wuzuiyuzui

Description

@wuzuiyuzui

Hello, I ran into a difficult problem when using yolov5. Reinstalling the system did not help this problem. I was very confused about what happened. This problem has been bothering me for several days. I have closed the previous one. Questions and gave me detailed bugs, can you give me some help??I can test and detect but I can not train.

🐛 Bug

have some problems when training

To Reproduce (REQUIRED)

Input:

adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)

Output:
Using torch 1.7.0+cu101 CUDA:0 (GeForce RTX 2080 Ti, 10997MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, log_imgs=16, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/exp10', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=7

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
24 [17, 20, 23] 1 32364 models.yolo.Detect [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7271276 parameters, 7271276 gradients

Transferred 364/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning 'coco128/labels/train2017.cache' for images and labels... 3219 found, 0 missing, 20 empty, 0 corrupted: 100%|██████████| 3219/3219 [00:00<?, ?it/s]
Scanning 'coco128/labels/val.cache' for images and labels... 246 found, 2 missing, 0 empty, 0 corrupted: 100%|██████████| 248/248 [00:00<?, ?it/s]

Analyzing anchors... anchors/target = 4.42, Best Possible Recall (BPR) = 0.9894
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 300 epochs...

 Epoch   gpu_mem       box       obj       cls     total   targets  img_size

0%| | 0/202 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ljy/yolov5-master/train.py", line 492, in
train(hyp, opt, device, tb_writer, wandb)
File "/home/ljy/yolov5-master/train.py", line 293, in train
scaler.scale(loss).backward()
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ljy/anaconda3/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 256, 20, 20], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f8ba4002b60
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
output: TensorDescriptor 0x7f8ba40033a0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 256, 20, 20,
strideA = 102400, 400, 20, 1,
weight: FilterDescriptor 0x7f8ba403e080
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 256, 3, 3,
Pointer addresses:
input: 0x7f8a73b60000
output: 0x7f8c792e0000
weight: 0x7f8d5b660000

Process finished with exit code 1

Environment

If applicable, add screenshots to help explain your problem.
-cudnn 7.6.4
-nvidia-driver 440.95

  • OS: [e.g. Ubuntu20.04]
  • GPU [e.g. 2080 Ti]
    -torch 1.7.0+cu101
    -cuda 10.1
    2020-11-28 16-06-58 的屏幕截图
    2020-11-28 16-06-58 的屏幕截图
    2020-11-28 16-07-04 的屏幕截图
    2020-11-28 16-08-32 的屏幕截图
    2020-11-28 16-08-37 的屏幕截图
    2020-11-28 16-08-59 的屏幕截图

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleStale and schedule for closing soonbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions