Skip to content

Problem on running Hyperparameter Evolution on Big Dataset #9916

@silvada95

Description

@silvada95

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Evolution

Bug

Good morning,

I am trying to run Hyperparameter Evolution on a relatively big dataset (>1million images and >200GB), but I am facing issues with it.

The code I am using to run it is quite simple:

python train.py --data my_dataset.yaml --weights 'yolov5s6.pt' --cfg yolov5s.yaml --batch 32 --img 1280 --epochs 1 --evolve 25

python train.py --data my_dataset.yaml --weights 'yolov5s6.pt' --cfg yolov5s.yaml --batch 32 --img 1280 --epochs 2 --evolve 12

Some of the errors that happen:

1- Run all the epochs for a given generation, then crash during the validation.

The errors that appear are

AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
Exception ignored in: <function StorageWeakRef.del at 0x2b6fee3035e0>

AttributeError: 'NoneType' object has no attribute '_free_weak_ref'
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=22501303.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

However, when I was checking, the batches were occupying 22GB of 32GB of the memory.

2- Sometimes it run the generation properly but them just stop to work at the model summary screen

Environment

  • Python 3.9 ( tried also 3.8)
  • GPU V100 32GB;
  • System Memory: Allocated 48GB for the CPUs
  • CUDA 11.1;
  • Torch 1.8;
  • Torchvision 0.9;

Also tried => Cuda 10.2, Torch 1.11, Torchvision 0.12

Both setups worked well on all the other applications, even in evolutions in smaller datasets...

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleStale and schedule for closing soonbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions