-
Notifications
You must be signed in to change notification settings - Fork 504
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Required prerequisites
- I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- Consider asking first in a Discussion.
Motivation
Currently, the SupervisedTrainer in align_anything/trainer/text_to_text/sft.py
and its subclass in align_anything/trainer/text_image_to_text/sft.py
lack built-in support for resuming training from a saved checkpoint. While the --save_interval
parameter works well to save checkpoints periodically (e.g., checkpoint-1000), there is no straightforward way to resume training from these checkpoints, including restoring the model weights, optimizer state, and global_step. This feature would greatly enhance usability for long-running training jobs that may need to be interrupted and resumed.
Solution
I’d like to request a feature to resume training seamlessly from a checkpoint by:
- Adding a command-line argument (e.g.,
--resume_from_checkpoint
) to specify the checkpoint path (e.g.,/path/to/checkpoint-1000
). - Loading the saved model weights into the DeepSpeed engine.
- Optionally restoring the optimizer state and global_step (if saved) to continue training from the exact point of interruption.
- Integrating this functionality into the existing training pipeline without requiring manual code changes.
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request