Skip to content

[Feature Request] Add Support for Resuming Training from Checkpoints #150

@LJH-cloud

Description

@LJH-cloud

Required prerequisites

Motivation

Currently, the SupervisedTrainer in align_anything/trainer/text_to_text/sft.py and its subclass in align_anything/trainer/text_image_to_text/sft.py lack built-in support for resuming training from a saved checkpoint. While the --save_interval parameter works well to save checkpoints periodically (e.g., checkpoint-1000), there is no straightforward way to resume training from these checkpoints, including restoring the model weights, optimizer state, and global_step. This feature would greatly enhance usability for long-running training jobs that may need to be interrupted and resumed.

Solution

I’d like to request a feature to resume training seamlessly from a checkpoint by:

  1. Adding a command-line argument (e.g., --resume_from_checkpoint) to specify the checkpoint path (e.g., /path/to/checkpoint-1000).
  2. Loading the saved model weights into the DeepSpeed engine.
  3. Optionally restoring the optimizer state and global_step (if saved) to continue training from the exact point of interruption.
  4. Integrating this functionality into the existing training pipeline without requiring manual code changes.

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions