Skip to content

Conversation

@tjruwase
Copy link
Contributor

  • FastPersist
  • ZeRO-Inference+SGLang

Requires deepspeedai/DeepSpeed#7215

tjruwase and others added 30 commits December 30, 2021 10:35
* Add checkpoint comparison

* Corrected a typo

Co-authored-by: Yang Li <[email protected]>
…ft/DeepSpeedExamples-internal into olruwase/fast_model_checkpoint
* save_checkpoint perf monitoring

* Disable checkpoint save on exit
…ft/DeepSpeedExamples-internal into staging-fast-model-checkpoint-v2
…ft/DeepSpeedExamples-internal into olruwase/fast_model_checkpoint
* save_checkpoint perf monitoring

* Disable checkpoint save on exit

* local rank arg
* save_checkpoint perf monitoring

* Disable checkpoint save on exit

* local rank arg

* Single writer option
…ft/DeepSpeedExamples-internal into olruwase/fast_model_checkpoint
tjruwase and others added 20 commits February 12, 2025 11:59
Signed-off-by: Olatunji Ruwase <[email protected]>
* Fast model checkpointing

* Support both legacy and serialized formats

* Add io_buffer_mb option

* Bug fix

* Force flush

* More model options; Refactor common codes

* --gpu option

* --half and more flexible options

* Add deepspeed.save_checkpoint()

* Free ds memory

* Improve repro

* Double I/O buffer (#56)

* Double I/O buffer (#60)

* Add checkpoint comparison (#62)

* Add checkpoint comparison

* Corrected a typo

Co-authored-by: Yang Li <[email protected]>

* save_checkpoint perf monitoring

* Disable checkpoint save on exit

* Perf statistics for save_checkpoint (#64)

* save_checkpoint perf monitoring

* Disable checkpoint save on exit

* add logs for a100-80

* add torch* error log with half flag but without fused flag

* log for error

* local rank arg

* Handle local_rank arg (#78)

* save_checkpoint perf monitoring

* Disable checkpoint save on exit

* local rank arg

* Single writer option

* Single writer option (#79)

* save_checkpoint perf monitoring

* Disable checkpoint save on exit

* local rank arg

* Single writer option

* Allow missing folder

* DP writer refactor

* Update for DS; Add GDS

Signed-off-by: Olatunji Ruwase <[email protected]>

* Integrate GDS into deepspeed_model_save

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: jerryyangli <[email protected]>
Co-authored-by: Yang Li <[email protected]>
Co-authored-by: GuanhuaWang <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
…dExamples-internal into olruwase/fast_persist
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
@tjruwase tjruwase merged commit 207c93c into master Jun 9, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants