Skip to content

Conversation

@Antlera
Copy link
Contributor

@Antlera Antlera commented Jul 3, 2025

Description:

This PR introduces scripts for benchmarking and fine-tuning with ZenFlow:

  • zf_benchmark.py: Benchmark script for evaluating offloading performance (adapted from offload_states.py by @tohtana ).
  • output_table.py: Parses and summarizes benchmark logs.
  • run_benchmark.sh: Automates benchmark runs with configurable parameters.
  • finetune_llama.py: Fine-tuning script for Llama-2 with DeepSpeed + ZenFlow.
  • finetune_llama.sh: Launch script for fine-tuning with environment setup.
  • zf_config.json: Example DeepSpeed config with ZenFlow optimizations.

Note: This PR is complimentary to PR #7391 on the main repo, and should be merged with (or after) merging PR #7391.

@Antlera Antlera requested a review from tjruwase as a code owner July 3, 2025 01:23
Antlera and others added 2 commits July 2, 2025 21:27
- Introduced `zf_benchmark.py` for model offloading benchmarking with DeepSpeed.
- Added `output_table.py` to parse and display benchmark results in a tabular format.
- Created `run_benchmark.sh` to automate benchmark runs with various configurations.

Signed-off-by: Tingfeng Lan <[email protected]>
- Introduced `finetune_llama.py` for fine-tuning the Llama-2 model using DeepSpeed and ZenFlow.
- Added `finetune_llama.sh` for automated training setup with environment variables and DeepSpeed command.
- Added `zf_config.json` example for DeepSpeed configuration with ZenFlow optimizations.

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
@Antlera Antlera force-pushed the zenflow_z1_2_example branch from ca441f5 to 0528aed Compare July 3, 2025 01:27
@Antlera
Copy link
Contributor Author

Antlera commented Jul 3, 2025

Hi, @tohtana @tjruwase. Could you help review this PR when you have time? This is the example PR for PR #7391. Thanks!

@Antlera
Copy link
Contributor Author

Antlera commented Aug 4, 2025

@sfc-gh-truwase Thanks for the great suggestions — I’ve applied them all!

@delock
Copy link
Contributor

delock commented Aug 6, 2025

Hi @Antlera I have a question. I saw ZenFlow has CPU running parameter update. Does DeepSpeed argument --bind_cores_to_rank helping with CPU optimizer efficiency? If true then maybe this switch could be add to launch script.

Here is a link to this switch. This is a switch for CPU backend, but for CPU offload this switch should help as well.
https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/accelerator-setup-guide.md#how-to-launch-deepspeed-on-intel-architecture-cpu

@Antlera
Copy link
Contributor Author

Antlera commented Aug 7, 2025

Hi @delock. Thank you for bringing this up—this is a great observation! I just tested --bind_cores_to_rank on an Intel Xeon Gold 6330 (112 threads) with 2 × L4 GPUs and did see a tangible benefit—about 60–80 ms per step in the CPU-optimizer phase when I bound cores as 0-53,54-107. The key, however, is leaving enough cores free for background services; if too few are reserved(e.g. only 1 left for background application), the gains quickly flatten or even reverse.

ZenFlow currently sidesteps most contention by evenly sharding os.cpu_count() cores across ranks, but a smarter policy that explicitly reserves background cores—and perhaps factors in NUMA locality—could push things further. I can open an issue to track this as a potential enhancement, and would welcome any best-practice core-binding patterns you’ve found effective.

I’ve shared more detailed discussion and logs in the email thread that Tunji forwarded—please let me know if you didn’t receive them.

@delock
Copy link
Contributor

delock commented Aug 8, 2025

Hi @delock. Thank you for bringing this up—this is a great observation! I just tested --bind_cores_to_rank on an Intel Xeon Gold 6330 (112 threads) with 2 × L4 GPUs and did see a tangible benefit—about 60–80 ms per step in the CPU-optimizer phase when I bound cores as 0-53,54-107. The key, however, is leaving enough cores free for background services; if too few are reserved(e.g. only 1 left for background application), the gains quickly flatten or even reverse.

ZenFlow currently sidesteps most contention by evenly sharding os.cpu_count() cores across ranks, but a smarter policy that explicitly reserves background cores—and perhaps factors in NUMA locality—could push things further. I can open an issue to track this as a potential enhancement, and would welcome any best-practice core-binding patterns you’ve found effective.

I’ve shared more detailed discussion and logs in the email thread that Tunji forwarded—please let me know if you didn’t receive them.

Yes, please open an issue, we could discuss this in detail. Yes, leaving one might not be enough if background service needs more cores. We can discuss how to make core binding tuning more easy to use.

sfc-gh-truwase added a commit to deepspeedai/DeepSpeed that referenced this pull request Aug 10, 2025
This PR adds a blog post and images for ZenFlow, introducing its design,
benefits, and usage. The blog explains how ZenFlow improves GPU
utilization by overlapping computation and communication during
offloaded training.

See also: 
#7391 – core ZenFlow implementation.
[#982](deepspeedai/DeepSpeedExamples#982) - –
benchmarking and fine-tuning example.

---------

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
@sfc-gh-truwase sfc-gh-truwase merged commit b4385e5 into deepspeedai:master Aug 16, 2025
2 checks passed
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
This PR adds a blog post and images for ZenFlow, introducing its design,
benefits, and usage. The blog explains how ZenFlow improves GPU
utilization by overlapping computation and communication during
offloaded training.

See also:
deepspeedai#7391 – core ZenFlow implementation.
[deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - –
benchmarking and fine-tuning example.

---------

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
Signed-off-by: lym <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
This PR adds a blog post and images for ZenFlow, introducing its design,
benefits, and usage. The blog explains how ZenFlow improves GPU
utilization by overlapping computation and communication during
offloaded training.

See also: 
deepspeedai#7391 – core ZenFlow implementation.
[deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - –
benchmarking and fine-tuning example.

---------

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants