[perf, data] feat: DP workload balance #3605

conver334 · 2025-09-25T01:56:01Z

What does this PR do?

Mitigate workload imbalance in DP.

As shown in the figure below, all ranks must synchronize after mini batch in DP. Stragglers with longer sequences delay all workers.

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [model] feat: polish megatron engine #3401
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

The line with the suffix Balance in the figure below can get better MFU in Qwen2.5-Math-7 GRPO.

API and Usage Example

split Data to n workload balanced chunks

_balance_data_proto(DataProto_obj, chunks)

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

As shown in the figure, the leftmost side shows the unsplit data with a global batch size of 16.

When DP = 2, existing methods directly split the batch into two ranks sequentially. You can see that in this case, rank 0 receives more tokens than rank 1.

The rightmost side shows our design. We model the workload generated by each data entry and use the Karmarkar-Karp algorithm to split the batch into two equal parts, ensuring that the total workload of each part is as close as possible.

The workload can be calculated using the FLOPS formula in verl. Here, we roughly estimate and hardcode the FLOPs by seqlens**2 + seqlens * 24576 (Attention+MLP of 7B model).

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

CLAassistant · 2025-09-25T01:56:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces workload balancing for Data Parallelism to mitigate the impact of stragglers with longer sequences. The core change is the _balance_data_proto function, which reorders data within a batch using the Karmarkar-Karp algorithm to equalize workload across DP ranks before splitting. My review has identified a critical bug that breaks data splitting when auto-padding is enabled, and a high-severity issue related to a hardcoded value that limits the feature's general applicability. Addressing these points will improve the correctness and maintainability of the implementation.

verl/single_controller/base/decorator.py

gemini-code-assist · 2025-09-25T01:57:15Z

verl/single_controller/base/decorator.py

+        # approximate workload of transformer block
+        workloads = seqlens**2 + seqlens * 33024


The workload calculation uses a hardcoded magic number 33024. According to the PR description, this is specific to a 7B model's MLP layer. This limits the general applicability of this workload balancing feature to other model architectures.

The function signature includes an unused model_config parameter, which should be used to pass model-specific configuration, such as the MLP size, to make the workload calculation more flexible and accurate for different models.

Consider refactoring this to use the model_config parameter. For example:

# approximate workload of transformer block mlp_factor = model_config.get("mlp_workload_factor", 33024) if model_config else 33024 workloads = seqlens**2 + seqlens * mlp_factor

… batching, adjust micro batch order to reduce PP bubble

conver334 requested review from zw0610 and wuxibin89 as code owners September 25, 2025 01:56

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

ISEEKYAN mentioned this pull request Sep 25, 2025

[BUG]The time consumption of these two functions, backward_step and get_grad_norm, has increased NVIDIA/Megatron-LM#1691

Open

conver334 added 2 commits September 26, 2025 01:16

DP workload balance

ea448e9

move dp balance to single_controller, add workload balance in dynamic…

7432a21

… batching, adjust micro batch order to reduce PP bubble

conver334 force-pushed the DP_workload_balance branch from 5e9feec to 7432a21 Compare September 26, 2025 08:30

conver334 requested review from eric-haibin-lin, vermouth1992, tongyx361 and PeterSH6 as code owners September 26, 2025 08:30

add comments

090a58b

conver334 force-pushed the DP_workload_balance branch from 880688e to 090a58b Compare September 28, 2025 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf, data] feat: DP workload balance #3605

[perf, data] feat: DP workload balance #3605

Uh oh!

conver334 commented Sep 25, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

Uh oh!

		# approximate workload of transformer block
		workloads = seqlens*2 + seqlens 33024

[perf, data] feat: DP workload balance #3605

Are you sure you want to change the base?

[perf, data] feat: DP workload balance #3605

Uh oh!

Conversation

conver334 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conver334 commented Sep 25, 2025 •

edited

Loading