AutoBatch: CUDA anomaly detected

### Search before asking

- [X] I have searched the YOLOv5 [issues](https://github.com/ultralytics/yolov5/issues) and [discussions](https://github.com/ultralytics/yolov5/discussions) and found no similar questions.


### Question

So I'm testing the autobatch feature which is pretty cool.
It seemed to work fine last week, but this week for whatever reason (maybe bc it's Monday, who knows ...) I'm having issues with it.

I'm running the yolov5s (latest git checkout ofc) and getting this (when using --batch -1)
Dataset is a slice from COCO

```
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7027720       6.744         2.414         27.87         35.49        (1, 3, 416, 416)                    list
     7027720       13.49         1.378         23.52         50.14        (2, 3, 416, 416)                    list
     7027720       26.98         1.380          23.8         56.75        (4, 3, 416, 416)                    list
     7027720       53.95         0.648         22.86         71.21        (8, 3, 416, 416)                    list
     7027720       107.9         1.330         26.38         91.88       (16, 3, 416, 416)                    list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 0.96G/7.79G (12%) ✅
```

Meanwhile, the nvtop output is this *before* running the train.py
So there isn't really anything in the GPU memory.
```
Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
 GPU 210MHz  MEM 405MHz  TEMP  53°C FAN  38% POW  19 / 220 W
 GPU[                                 0%] MEM[|                   0.208Gi/8.000Gi]
```

I am unsure about this from AutoBatch
> 7.79G total, **2.20G reserved,** 0.05G allocated, 5.54G free

The  **2.20G reserved** is weird, because I stopped everything (including gdm3), so nothing is running on the GPU.
(besides the training process later).

And I can easily set batch to 80 and it works fine:
```
 Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 3@16x RX: 30.27 MiB/s TX: 8.789 MiB/s
 GPU 1905MHz MEM 6800MHz TEMP  68°C FAN  63% POW 199 / 220 W
 GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.319Gi/8.000Gi]
```

```
    PID USER DEV    TYPE  GPU        GPU MEM    CPU  HOST MEM Command
   6404 user   0 Compute  91%   7237MiB  88%   105%  14616MiB python train.py --img 416 --batch 80 --epochs 400  --cache --weights yolov5s.pt --data ...
```

I obviously did the recommended restart environment and even restarted the machine. Autobatch still complained about around 2.20G reserved

Any ideas how I can investigate this ?

My guess is, the 2.2GB do mess up the interpolation for autobatch because the GPU_mem (GB) column doesn't make much sense.

```
  GPU_mem (GB)       input 
   2.414       (1, 3, 416, 416)
   1.378       (2, 3, 416, 416)
   1.380       (4, 3, 416, 416)
   0.648       (8, 3, 416, 416) 
   1.330       (16, 3, 416, 416)
```

### Additional

- Maybe the issue title should be changed to AutoBatch:  CUDA anomaly detected

- some additional system info
```
Ubuntu 22.04.1 LTS
Kernel Linux 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
```

```
nvidia-smi
Mon Sep  5 16:22:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
```


- Update:
During training, it shows me usage around
```
      Epoch    GPU_mem   ...
    112/399      5.79G
```
So not sure where the rest went (aka the difference to the 7.2GB in nvtop) ...



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AutoBatch: CUDA anomaly detected #9287

Search before asking

Question

Additional

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

AutoBatch: CUDA anomaly detected #9287

Description

Search before asking

Question

Additional

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions