Skip to content

AutoBatch: CUDA anomaly detected #9287

@alexk-ede

Description

@alexk-ede

Search before asking

Question

So I'm testing the autobatch feature which is pretty cool.
It seemed to work fine last week, but this week for whatever reason (maybe bc it's Monday, who knows ...) I'm having issues with it.

I'm running the yolov5s (latest git checkout ofc) and getting this (when using --batch -1)
Dataset is a slice from COCO

AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7027720       6.744         2.414         27.87         35.49        (1, 3, 416, 416)                    list
     7027720       13.49         1.378         23.52         50.14        (2, 3, 416, 416)                    list
     7027720       26.98         1.380          23.8         56.75        (4, 3, 416, 416)                    list
     7027720       53.95         0.648         22.86         71.21        (8, 3, 416, 416)                    list
     7027720       107.9         1.330         26.38         91.88       (16, 3, 416, 416)                    list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 0.96G/7.79G (12%) ✅

Meanwhile, the nvtop output is this before running the train.py
So there isn't really anything in the GPU memory.

Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
 GPU 210MHz  MEM 405MHz  TEMP  53°C FAN  38% POW  19 / 220 W
 GPU[                                 0%] MEM[|                   0.208Gi/8.000Gi]

I am unsure about this from AutoBatch

7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free

The 2.20G reserved is weird, because I stopped everything (including gdm3), so nothing is running on the GPU.
(besides the training process later).

And I can easily set batch to 80 and it works fine:

 Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 3@16x RX: 30.27 MiB/s TX: 8.789 MiB/s
 GPU 1905MHz MEM 6800MHz TEMP  68°C FAN  63% POW 199 / 220 W
 GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.319Gi/8.000Gi]
    PID USER DEV    TYPE  GPU        GPU MEM    CPU  HOST MEM Command
   6404 user   0 Compute  91%   7237MiB  88%   105%  14616MiB python train.py --img 416 --batch 80 --epochs 400  --cache --weights yolov5s.pt --data ...

I obviously did the recommended restart environment and even restarted the machine. Autobatch still complained about around 2.20G reserved

Any ideas how I can investigate this ?

My guess is, the 2.2GB do mess up the interpolation for autobatch because the GPU_mem (GB) column doesn't make much sense.

  GPU_mem (GB)       input 
   2.414       (1, 3, 416, 416)
   1.378       (2, 3, 416, 416)
   1.380       (4, 3, 416, 416)
   0.648       (8, 3, 416, 416) 
   1.330       (16, 3, 416, 416)

Additional

  • Maybe the issue title should be changed to AutoBatch: CUDA anomaly detected

  • some additional system info

Ubuntu 22.04.1 LTS
Kernel Linux 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
Mon Sep  5 16:22:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
  • Update:
    During training, it shows me usage around
      Epoch    GPU_mem   ...
    112/399      5.79G

So not sure where the rest went (aka the difference to the 7.2GB in nvtop) ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions