-
-
Notifications
You must be signed in to change notification settings - Fork 17.1k
Description
Search before asking
- I have searched the YOLOv5 issues and discussions and found no similar questions.
Question
So I'm testing the autobatch feature which is pretty cool.
It seemed to work fine last week, but this week for whatever reason (maybe bc it's Monday, who knows ...) I'm having issues with it.
I'm running the yolov5s (latest git checkout ofc) and getting this (when using --batch -1)
Dataset is a slice from COCO
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
7027720 6.744 2.414 27.87 35.49 (1, 3, 416, 416) list
7027720 13.49 1.378 23.52 50.14 (2, 3, 416, 416) list
7027720 26.98 1.380 23.8 56.75 (4, 3, 416, 416) list
7027720 53.95 0.648 22.86 71.21 (8, 3, 416, 416) list
7027720 107.9 1.330 26.38 91.88 (16, 3, 416, 416) list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 0.96G/7.79G (12%) ✅
Meanwhile, the nvtop output is this before running the train.py
So there isn't really anything in the GPU memory.
Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
GPU 210MHz MEM 405MHz TEMP 53°C FAN 38% POW 19 / 220 W
GPU[ 0%] MEM[| 0.208Gi/8.000Gi]
I am unsure about this from AutoBatch
7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free
The 2.20G reserved is weird, because I stopped everything (including gdm3), so nothing is running on the GPU.
(besides the training process later).
And I can easily set batch to 80 and it works fine:
Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 3@16x RX: 30.27 MiB/s TX: 8.789 MiB/s
GPU 1905MHz MEM 6800MHz TEMP 68°C FAN 63% POW 199 / 220 W
GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.319Gi/8.000Gi]
PID USER DEV TYPE GPU GPU MEM CPU HOST MEM Command
6404 user 0 Compute 91% 7237MiB 88% 105% 14616MiB python train.py --img 416 --batch 80 --epochs 400 --cache --weights yolov5s.pt --data ...
I obviously did the recommended restart environment and even restarted the machine. Autobatch still complained about around 2.20G reserved
Any ideas how I can investigate this ?
My guess is, the 2.2GB do mess up the interpolation for autobatch because the GPU_mem (GB) column doesn't make much sense.
GPU_mem (GB) input
2.414 (1, 3, 416, 416)
1.378 (2, 3, 416, 416)
1.380 (4, 3, 416, 416)
0.648 (8, 3, 416, 416)
1.330 (16, 3, 416, 416)
Additional
-
Maybe the issue title should be changed to AutoBatch: CUDA anomaly detected
-
some additional system info
Ubuntu 22.04.1 LTS
Kernel Linux 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
Mon Sep 5 16:22:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
- Update:
During training, it shows me usage around
Epoch GPU_mem ...
112/399 5.79G
So not sure where the rest went (aka the difference to the 7.2GB in nvtop) ...