Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Antlera · 2025-07-03T01:23:42Z

Description:

This PR introduces scripts for benchmarking and fine-tuning with ZenFlow:

zf_benchmark.py: Benchmark script for evaluating offloading performance (adapted from offload_states.py by @tohtana ).
output_table.py: Parses and summarizes benchmark logs.
run_benchmark.sh: Automates benchmark runs with configurable parameters.
finetune_llama.py: Fine-tuning script for Llama-2 with DeepSpeed + ZenFlow.
finetune_llama.sh: Launch script for fine-tuning with environment setup.
zf_config.json: Example DeepSpeed config with ZenFlow optimizations.

Note: This PR is complimentary to PR #7391 on the main repo, and should be merged with (or after) merging PR #7391.

- Introduced `zf_benchmark.py` for model offloading benchmarking with DeepSpeed. - Added `output_table.py` to parse and display benchmark results in a tabular format. - Created `run_benchmark.sh` to automate benchmark runs with various configurations. Signed-off-by: Tingfeng Lan <[email protected]>

- Introduced `finetune_llama.py` for fine-tuning the Llama-2 model using DeepSpeed and ZenFlow. - Added `finetune_llama.sh` for automated training setup with environment variables and DeepSpeed command. - Added `zf_config.json` example for DeepSpeed configuration with ZenFlow optimizations. Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>

Antlera · 2025-07-03T01:29:36Z

Hi, @tohtana @tjruwase. Could you help review this PR when you have time? This is the example PR for PR #7391. Thanks!

training/DeepSpeed-ZenFlow/benchmark/README.md

training/DeepSpeed-ZenFlow/finetuning/README.md

Signed-off-by: Tingfeng Lan <[email protected]>

Antlera · 2025-08-04T18:47:42Z

@sfc-gh-truwase Thanks for the great suggestions — I’ve applied them all!

delock · 2025-08-06T05:41:22Z

Hi @Antlera I have a question. I saw ZenFlow has CPU running parameter update. Does DeepSpeed argument --bind_cores_to_rank helping with CPU optimizer efficiency? If true then maybe this switch could be add to launch script.

Here is a link to this switch. This is a switch for CPU backend, but for CPU offload this switch should help as well.
https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/accelerator-setup-guide.md#how-to-launch-deepspeed-on-intel-architecture-cpu

Antlera · 2025-08-07T03:42:45Z

Hi @delock. Thank you for bringing this up—this is a great observation! I just tested --bind_cores_to_rank on an Intel Xeon Gold 6330 (112 threads) with 2 × L4 GPUs and did see a tangible benefit—about 60–80 ms per step in the CPU-optimizer phase when I bound cores as 0-53,54-107. The key, however, is leaving enough cores free for background services; if too few are reserved(e.g. only 1 left for background application), the gains quickly flatten or even reverse.

ZenFlow currently sidesteps most contention by evenly sharding os.cpu_count() cores across ranks, but a smarter policy that explicitly reserves background cores—and perhaps factors in NUMA locality—could push things further. I can open an issue to track this as a potential enhancement, and would welcome any best-practice core-binding patterns you’ve found effective.

I’ve shared more detailed discussion and logs in the email thread that Tunji forwarded—please let me know if you didn’t receive them.

delock · 2025-08-08T14:54:27Z

Hi @delock. Thank you for bringing this up—this is a great observation! I just tested --bind_cores_to_rank on an Intel Xeon Gold 6330 (112 threads) with 2 × L4 GPUs and did see a tangible benefit—about 60–80 ms per step in the CPU-optimizer phase when I bound cores as 0-53,54-107. The key, however, is leaving enough cores free for background services; if too few are reserved(e.g. only 1 left for background application), the gains quickly flatten or even reverse.

ZenFlow currently sidesteps most contention by evenly sharding os.cpu_count() cores across ranks, but a smarter policy that explicitly reserves background cores—and perhaps factors in NUMA locality—could push things further. I can open an issue to track this as a potential enhancement, and would welcome any best-practice core-binding patterns you’ve found effective.

I’ve shared more detailed discussion and logs in the email thread that Tunji forwarded—please let me know if you didn’t receive them.

Yes, please open an issue, we could discuss this in detail. Yes, leaving one might not be enough if background service needs more cores. We can discuss how to make core binding tuning more easy to use.

This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: #7391 – core ZenFlow implementation. [#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]>

This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: lym <[email protected]>

This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]>

Antlera requested a review from tjruwase as a code owner July 3, 2025 01:23

Antlera and others added 2 commits July 2, 2025 21:27

Antlera force-pushed the zenflow_z1_2_example branch from ca441f5 to 0528aed Compare July 3, 2025 01:27

Antlera mentioned this pull request Jul 16, 2025

Add Zenflow code for Stage 1 & 2 deepspeedai/DeepSpeed#7391

Merged

Antlera mentioned this pull request Aug 1, 2025

Add blog for ZenFlow deepspeedai/DeepSpeed#7463

Merged

sfc-gh-truwase reviewed Aug 2, 2025

View reviewed changes

training/DeepSpeed-ZenFlow/benchmark/README.md Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Aug 2, 2025

View reviewed changes

training/DeepSpeed-ZenFlow/finetuning/README.md Show resolved Hide resolved

Antlera and others added 3 commits August 4, 2025 14:42

Add explanation tips for interpreting benchmark results in README

0309313

Signed-off-by: Tingfeng Lan <[email protected]>

Add guidance on step/latency interpretation

0b18cca

Signed-off-by: Tingfeng Lan <[email protected]>

Merge branch 'master' into zenflow_z1_2_example

c4946a1

sfc-gh-truwase approved these changes Aug 8, 2025

View reviewed changes

tjruwase approved these changes Aug 8, 2025

View reviewed changes

delock mentioned this pull request Aug 11, 2025

[BUG] GPU OOM when finetune Qwen2.5-14B with ZeRO2+offload on 4xA100 40G cards deepspeedai/DeepSpeed#7482

Closed

sfc-gh-truwase merged commit b4385e5 into deepspeedai:master Aug 16, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Antlera commented Jul 3, 2025

Uh oh!

Antlera commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Antlera commented Aug 4, 2025

Uh oh!

delock commented Aug 6, 2025

Uh oh!

Antlera commented Aug 7, 2025

Uh oh!

delock commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Add Benchmarking and Fine-Tuning Support for ZenFlow #982

Conversation

Antlera commented Jul 3, 2025

Uh oh!

Antlera commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Antlera commented Aug 4, 2025

Uh oh!

delock commented Aug 6, 2025

Uh oh!

Antlera commented Aug 7, 2025

Uh oh!

delock commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Antlera commented Jul 3, 2025 •

edited

Loading

delock commented Aug 8, 2025 •

edited

Loading