[DP] Internal Load Balancing Per Node [`one-pod-per-node`] #21238

robertgshaw2-redhat · 2025-07-20T02:17:35Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

add capability of running with API server per node. This ensures we can use local communication between the EngineCore and AsyncLLM (i.e. do not have to go over the network), while allowing us to
this setup allows us to load balance across nodes externally --- enabling to work in a one-pod-per-node configuration for llm-d to avoid the UCX issues we have with one-pod-per-rank balancing

NOTE:

prometheus metrics are broken for pd. this Pr is compatible with the fix ([DP] Fix Prometheus Logging #21257)
in this PR, we use the DP coordinator of the intra-node LB. We actually don't need to do this (we could use something local to LB). this change would require more complex surgery to vllm.

Resolves #21261

FOLLOW UPS:

add ability to run with N Servers per node as well
consider updating the --data-parallel-rank UX for the old external LB to a unified setup (cc @njhill)
consider updating architecture such that the DPCoordinator to only send LB messages

Test Plan

MODEL:= "Qwen/Qwen3-30B-A3B-FP8"

dp_a:
  VLLM_LOGGING_LEVEL=DEBUG chg run --gpus 2 --  vllm serve {{MODEL}} \
    --port 8100 \
    --data-parallel-hybrid-lb \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 0 \
    --data-parallel-rpc-port 1234 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b:
  VLLM_LOGGING_LEVEL=DEBUG chg run --gpus 2 -- vllm serve {{MODEL}} \
    --port 8200 \
    --data-parallel-hybrid-lb \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 2 \
    --data-parallel-rpc-port 1234 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

eval PORT CONCURRENT LIMIT:
  lm_eval --model local-completions --tasks gsm8k \
    --model_args model={{MODEL}},base_url=http://127.0.0.1:{{PORT}}/v1/completions,num_concurrent={{CONCURRENT}},num_retries=0,tokenized_requests=False \
    --limit {{LIMIT}}

launch

just dp_a

just dp_b

-run concurrently

just eval 8100 100 1000

just eval 8200 100 1000

Test Result

local-completions (model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:8100/v1/completions,num_concurrent=10,num_retries=0,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0302|
|     |       |strict-match    |     5|exact_match|↑  | 0.96|±  |0.0197|

NOTE:

Also confirmed old ways works:

dp_a_internal_lb PORT:
  chg run --gpus 2 -- vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-rpc-port 1235 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_internal_lb:
  chg run --gpus 2 -- vllm serve {{MODEL}} \
    --headless \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 2 \
    --data-parallel-rpc-port 1235 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_a_external_lb PORT:
   chg run --gpus 1 -- vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 2 \
    --data-parallel-rank 0 \
    --data-parallel-rpc-port 1236 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_external_lb PORT:
  chg run --gpus 1 -- vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 2 \
    --data-parallel-rank 1 \
    --data-parallel-rpc-port 1236 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

(Optional) Documentation Update

Signed-off-by: Robert Shaw <[email protected]>

github-actions · 2025-07-20T02:17:42Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Robert Shaw <[email protected]>

gemini-code-assist

Code Review

This pull request introduces changes to support internal load balancing for data-parallel setups, specifically in a "one-pod-per-node" configuration. The changes involve modifications to the engine argument parsing, distributed setup, and communication logic. The most critical issues are the presence of a "HACK" that hardcodes a key configuration variable and several instances of commented-out code and debugging artifacts (e.g., print and logger.info statements). These should be removed or replaced with proper, configurable implementations to ensure the code is clean, maintainable, and production-ready. Additionally, a todo comment indicates that some parts of the code may be incomplete or require further updates. Please address these points to improve the quality and clarity of the codebase.

vllm/v1/engine/utils.py

vllm/v1/engine/core.py

vllm/v1/engine/core_client.py

vllm/v1/engine/core.py

vllm/engine/arg_utils.py

gemini-code-assist · 2025-07-20T02:19:02Z

vllm/entrypoints/cli/serve.py

+            # if args.data_parallel_start_rank:
+            #     raise ValueError(
+            #         "data_parallel_start_rank is only applicable "
+            #         "in headless mode. "
+            #         "Add --headless flag to enable headless mode.")


This validation logic has been commented out. If this check is no longer required, the commented code should be removed. If the check is still necessary, it should be re-enabled.

Suggested change

# if args.data_parallel_start_rank:

# raise ValueError(

# "data_parallel_start_rank is only applicable "

# "in headless mode. "

# "Add --headless flag to enable headless mode.")

if args.data_parallel_start_rank:

raise ValueError(

"data_parallel_start_rank is only applicable "

"in headless mode. "

"Add --headless flag to enable headless mode.")

vllm/v1/engine/utils.py

Signed-off-by: Robert Shaw <[email protected]>

njhill · 2025-07-20T12:43:02Z

@robertgshaw2-redhat I can spend some time on this tomorrow (Monday) if not before. This would be a third DP mode which is kind of a hybrid of the two existing ones. It will need some change to the coordinator and/or client load-balancing logic to constrain the set of engines considered to those associated with each API server.

Signed-off-by: Robert Shaw <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

…per-node-lb

Signed-off-by: Nick Hill <[email protected]>

mergify · 2025-07-23T14:52:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # .buildkite/test-pipeline.yaml

njhill · 2025-07-23T17:27:54Z

The test failure is just due to too strict tolerance for the balancing. We can wait for the remaining tests to finish and I can then push a change to relax the tolerance.

tlrmchlsmth · 2025-07-23T17:44:12Z

vllm/engine/arg_utils.py

+            # Use full external lb if we have local_size of 1.
+            self.data_parallel_hybrid_lb = False
+        elif self.data_parallel_size_local is not None and (
+                self.data_parallel_size_local != self.data_parallel_size):


This condition (self.data_parallel_size_local != self.data_parallel_size) makes it so that you can't set --data-parallel-hybrid-lb on a single node, which is really annoying since then you have to have different command line args for the single and multinode cases

Signed-off-by: Tyler Michael Smith <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: 董巍 <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: avigny <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: shuw <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: x22x22 <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Robert Shaw added 2 commits July 19, 2025 16:27

added debug logging

14f13ed

Signed-off-by: Robert Shaw <[email protected]>

updated

b90d331

Signed-off-by: Robert Shaw <[email protected]>

updated

aefeeed

Signed-off-by: Robert Shaw <[email protected]>

mergify bot added frontend v1 labels Jul 20, 2025

gemini-code-assist bot reviewed Jul 20, 2025

View reviewed changes

Robert Shaw added 15 commits July 20, 2025 02:19

updated

59a9583

Signed-off-by: Robert Shaw <[email protected]>

updated

48cf09b

Signed-off-by: Robert Shaw <[email protected]>

updated

2fd0587

Signed-off-by: Robert Shaw <[email protected]>

updated

14cf3c4

Signed-off-by: Robert Shaw <[email protected]>

updated

4f5d3ea

Signed-off-by: Robert Shaw <[email protected]>

updated

14db660

Signed-off-by: Robert Shaw <[email protected]>

updated

2aa4975

Signed-off-by: Robert Shaw <[email protected]>

cleanup

b142571

Signed-off-by: Robert Shaw <[email protected]>

updated

e1843b7

Signed-off-by: Robert Shaw <[email protected]>

updated

d2d54e9

Signed-off-by: Robert Shaw <[email protected]>

fix lb issues

4438796

Signed-off-by: Robert Shaw <[email protected]>

updated

2a68433

Signed-off-by: Robert Shaw <[email protected]>

updatedd

1ced153

Signed-off-by: Robert Shaw <[email protected]>

nits

b9c0f65

Signed-off-by: Robert Shaw <[email protected]>

nits

dbc51d6

Signed-off-by: Robert Shaw <[email protected]>

robertgshaw2-redhat requested review from tlrmchlsmth and njhill July 20, 2025 03:49

updated

471fa4a

Signed-off-by: Robert Shaw <[email protected]>

Robert Shaw added 4 commits July 20, 2025 13:34

stash

6569fac

Signed-off-by: Robert Shaw <[email protected]>

stash

1e5303a

Signed-off-by: Robert Shaw <[email protected]>

convert to use only one prometheus stat logger per async llm

a69edca

Signed-off-by: Robert Shaw <[email protected]>

convert to use only one prometheus stat logger per async llm

de91a3c

Signed-off-by: Robert Shaw <[email protected]>

fix handshake

75bd8ea

Signed-off-by: Nick Hill <[email protected]>

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025

fix cross-node headless arg validation

aca3ce6

Signed-off-by: Nick Hill <[email protected]>

njhill approved these changes Jul 22, 2025

View reviewed changes

njhill added 5 commits July 23, 2025 08:08

fix handshake mock test

f63cc19

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into one-pod-…

1bd5f2f

…per-node-lb

fix bad merge

8601a22

Signed-off-by: Nick Hill <[email protected]>

[Tests] Add tests for headless internal DP LB

d95aedd

Signed-off-by: Nick Hill <[email protected]>

CI tests for hybrid DPLB mode

6328c80

Signed-off-by: Nick Hill <[email protected]>

mergify bot added the ci/build label Jul 23, 2025

njhill added 2 commits July 23, 2025 14:55

fix internal_dp_lb tests

1c300fc

Signed-off-by: Nick Hill <[email protected]>

rename test

fb0cf7e

Signed-off-by: Nick Hill <[email protected]>

mergify bot added the needs-rebase label Jul 23, 2025

Merge remote-tracking branch 'origin/main' into one-pod-per-node-lb

5fb6809

# Conflicts: # .buildkite/test-pipeline.yaml

mergify bot removed the needs-rebase label Jul 23, 2025

tlrmchlsmth reviewed Jul 23, 2025

View reviewed changes

relax hybrid dp asserts

35f3782

Signed-off-by: Tyler Michael Smith <[email protected]>

simon-mo merged commit d5b981f into vllm-project:main Jul 24, 2025
96 of 97 checks passed

njhill mentioned this pull request Jul 24, 2025

[DP] Support api-server-count > 0 in hybrid DP LB mode #21510

Merged

tlrmchlsmth mentioned this pull request Jul 24, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Open

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DP] Internal Load Balancing Per Node [`one-pod-per-node`] #21238

[DP] Internal Load Balancing Per Node [`one-pod-per-node`] #21238

Uh oh!

robertgshaw2-redhat commented Jul 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Jul 20, 2025

Uh oh!

mergify bot commented Jul 23, 2025

Uh oh!

njhill commented Jul 23, 2025

Uh oh!

tlrmchlsmth Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[DP] Internal Load Balancing Per Node [one-pod-per-node] #21238

[DP] Internal Load Balancing Per Node [one-pod-per-node] #21238

Uh oh!

Conversation

robertgshaw2-redhat commented Jul 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Jul 20, 2025

Uh oh!

mergify bot commented Jul 23, 2025

Uh oh!

njhill commented Jul 23, 2025

Uh oh!

tlrmchlsmth Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[DP] Internal Load Balancing Per Node [`one-pod-per-node`] #21238

[DP] Internal Load Balancing Per Node [`one-pod-per-node`] #21238

robertgshaw2-redhat commented Jul 20, 2025 •

edited by github-actions bot

Loading