Problem on running Hyperparameter Evolution on Big Dataset

### Search before asking

- [X] I have searched the YOLOv5 [issues](https://github.com/ultralytics/yolov5/issues) and found no similar bug report.


### YOLOv5 Component

Evolution

### Bug

Good morning,

I am trying to run Hyperparameter Evolution on a relatively big dataset (>1million images and >200GB), but I am facing issues with it. 

The code I am using to run it is quite simple:

python train.py --data my_dataset.yaml --weights 'yolov5s6.pt' --cfg yolov5s.yaml --batch 32 --img 1280 --epochs 1 --evolve 25

python train.py --data my_dataset.yaml --weights 'yolov5s6.pt' --cfg yolov5s.yaml --batch 32 --img 1280 --epochs 2 --evolve 12

Some of the errors that happen:

1- Run all the epochs for a given generation, then crash during the validation. 

The errors that appear are

AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
Exception ignored in: <function StorageWeakRef.__del__ at 0x2b6fee3035e0>

AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=22501303.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

However, when I was checking, the batches were occupying 22GB of 32GB of the memory.

2- Sometimes it run the generation properly but them just stop to work at the model summary screen


### Environment

- Python 3.9 ( tried also 3.8)
- GPU V100 32GB;
- System Memory: Allocated 48GB for the CPUs
- CUDA 11.1; 
- Torch 1.8;
- Torchvision 0.9;

Also tried => Cuda 10.2, Torch 1.11, Torchvision 0.12 

Both setups worked well on all the other applications, even in evolutions in smaller datasets...

### Minimal Reproducible Example

_No response_

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Problem on running Hyperparameter Evolution on Big Dataset #9916

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Problem on running Hyperparameter Evolution on Big Dataset #9916

Description

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions