Skip to content

Error when resuming a training #7394

@Pharaun85

Description

@Pharaun85

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Due to GPU time restrictions, I train until it cuts off and then relaunch the training with the --resume option.
Until yesterday it worked perfectly, but today I get this error whatever the checkpoint with which I relaunch the training. I have tried with some old ones that I know worked, and the error is the same. Has anything changed in the optimizer structure?

Environment

Using torch 1.10.0+cu111 (Tesla K80)

Google Colab

Minimal Reproducible Example

!python train.py --img 1280 --batch 16 --epochs 50 --data /content/drive/MyDrive/OIv6/dataset.yaml --project /content/drive/MyDrive/OIv6/runs/train --weights yolov5s6.pt --hyp hyp.VOC.yaml --optimizer AdamW --device 0

!python train.py --resume /content/drive/MyDrive/OIv6/runs/train/exp/weights/last.pt --device 0

Traceback (most recent call last):
File "train.py", line 667, in
main(opt)
File "train.py", line 562, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 191, in train
optimizer.load_state_dict(ckpt['optimizer'])
File "/usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py", line 146, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions