[Feature Request] Add Support for Resuming Training from Checkpoints

### Required prerequisites

- [x] I have searched the [Issue Tracker](https://github.com/PKU-Alignment/align-anything/issues) and [Discussions](https://github.com/PKU-Alignment/align-anything/discussions) that this hasn't already been reported. (+1 or comment there if it has.)
- [x] Consider asking first in a [Discussion](https://github.com/PKU-Alignment/align-anything/discussions/new).

### Motivation


Currently, the SupervisedTrainer in `align_anything/trainer/text_to_text/sft.py` and its subclass in `align_anything/trainer/text_image_to_text/sft.py` lack built-in support for resuming training from a saved checkpoint. While the `--save_interval` parameter works well to save checkpoints periodically (e.g., checkpoint-1000), there is no straightforward way to resume training from these checkpoints, including restoring the model weights, optimizer state, and global_step. This feature would greatly enhance usability for long-running training jobs that may need to be interrupted and resumed.

### Solution

I’d like to request a feature to resume training seamlessly from a checkpoint by:

1. Adding a command-line argument (e.g.,` --resume_from_checkpoint`) to specify the checkpoint path (e.g., `/path/to/checkpoint-1000`).
2. Loading the saved model weights into the DeepSpeed engine.
3. Optionally restoring the optimizer state and global_step (if saved) to continue training from the exact point of interruption.
4. Integrating this functionality into the existing training pipeline without requiring manual code changes.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Add Support for Resuming Training from Checkpoints #150

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add Support for Resuming Training from Checkpoints #150

Description

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions