Skip to content

[DP] Internal Load Balancing Per Node [one-pod-per-node] #21238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 97 commits into from
Jul 24, 2025

Conversation

robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jul 20, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  • add capability of running with API server per node. This ensures we can use local communication between the EngineCore and AsyncLLM (i.e. do not have to go over the network), while allowing us to
  • this setup allows us to load balance across nodes externally --- enabling to work in a one-pod-per-node configuration for llm-d to avoid the UCX issues we have with one-pod-per-rank balancing

NOTE:

  • prometheus metrics are broken for pd. this Pr is compatible with the fix ([DP] Fix Prometheus Logging #21257)
  • in this PR, we use the DP coordinator of the intra-node LB. We actually don't need to do this (we could use something local to LB). this change would require more complex surgery to vllm.

Resolves #21261

FOLLOW UPS:

  • add ability to run with N Servers per node as well
  • consider updating the --data-parallel-rank UX for the old external LB to a unified setup (cc @njhill)
  • consider updating architecture such that the DPCoordinator to only send LB messages

Test Plan

MODEL:= "Qwen/Qwen3-30B-A3B-FP8"

dp_a:
  VLLM_LOGGING_LEVEL=DEBUG chg run --gpus 2 --  vllm serve {{MODEL}} \
    --port 8100 \
    --data-parallel-hybrid-lb \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 0 \
    --data-parallel-rpc-port 1234 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b:
  VLLM_LOGGING_LEVEL=DEBUG chg run --gpus 2 -- vllm serve {{MODEL}} \
    --port 8200 \
    --data-parallel-hybrid-lb \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 2 \
    --data-parallel-rpc-port 1234 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

eval PORT CONCURRENT LIMIT:
  lm_eval --model local-completions --tasks gsm8k \
    --model_args model={{MODEL}},base_url=http://127.0.0.1:{{PORT}}/v1/completions,num_concurrent={{CONCURRENT}},num_retries=0,tokenized_requests=False \
    --limit {{LIMIT}}
  • launch
just dp_a
just dp_b

-run concurrently

just eval 8100 100 1000
just eval 8200 100 1000

Test Result

local-completions (model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:8100/v1/completions,num_concurrent=10,num_retries=0,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|| 0.90|±  |0.0302|
|     |       |strict-match    |     5|exact_match|| 0.96|±  |0.0197|

NOTE:

Also confirmed old ways works:

dp_a_internal_lb PORT:
  chg run --gpus 2 -- vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-rpc-port 1235 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_internal_lb:
  chg run --gpus 2 -- vllm serve {{MODEL}} \
    --headless \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 2 \
    --data-parallel-rpc-port 1235 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_a_external_lb PORT:
   chg run --gpus 1 -- vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 2 \
    --data-parallel-rank 0 \
    --data-parallel-rpc-port 1236 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_external_lb PORT:
  chg run --gpus 1 -- vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 2 \
    --data-parallel-rank 1 \
    --data-parallel-rpc-port 1236 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

(Optional) Documentation Update

Robert Shaw added 2 commits July 19, 2025 16:27
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Robert Shaw <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces changes to support internal load balancing for data-parallel setups, specifically in a "one-pod-per-node" configuration. The changes involve modifications to the engine argument parsing, distributed setup, and communication logic. The most critical issues are the presence of a "HACK" that hardcodes a key configuration variable and several instances of commented-out code and debugging artifacts (e.g., print and logger.info statements). These should be removed or replaced with proper, configurable implementations to ensure the code is clean, maintainable, and production-ready. Additionally, a todo comment indicates that some parts of the code may be incomplete or require further updates. Please address these points to improve the quality and clarity of the codebase.

Comment on lines 48 to 52
# if args.data_parallel_start_rank:
# raise ValueError(
# "data_parallel_start_rank is only applicable "
# "in headless mode. "
# "Add --headless flag to enable headless mode.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This validation logic has been commented out. If this check is no longer required, the commented code should be removed. If the check is still necessary, it should be re-enabled.

Suggested change
# if args.data_parallel_start_rank:
# raise ValueError(
# "data_parallel_start_rank is only applicable "
# "in headless mode. "
# "Add --headless flag to enable headless mode.")
if args.data_parallel_start_rank:
raise ValueError(
"data_parallel_start_rank is only applicable "
"in headless mode. "
"Add --headless flag to enable headless mode.")

Robert Shaw added 15 commits July 20, 2025 02:19
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
@njhill
Copy link
Member

njhill commented Jul 20, 2025

@robertgshaw2-redhat I can spend some time on this tomorrow (Monday) if not before. This would be a third DP mode which is kind of a hybrid of the two existing ones. It will need some change to the coordinator and/or client load-balancing logic to constrain the set of engines considered to those associated with each API server.

Robert Shaw added 4 commits July 20, 2025 13:34
Signed-off-by: Nick Hill <[email protected]>
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025
@mergify mergify bot added the ci/build label Jul 23, 2025
njhill added 2 commits July 23, 2025 14:55
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Copy link

mergify bot commented Jul 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 23, 2025
# Conflicts:
#	.buildkite/test-pipeline.yaml
@mergify mergify bot removed the needs-rebase label Jul 23, 2025
@njhill
Copy link
Member

njhill commented Jul 23, 2025

The test failure is just due to too strict tolerance for the balancing. We can wait for the remaining tests to finish and I can then push a change to relax the tolerance.

# Use full external lb if we have local_size of 1.
self.data_parallel_hybrid_lb = False
elif self.data_parallel_size_local is not None and (
self.data_parallel_size_local != self.data_parallel_size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition (self.data_parallel_size_local != self.data_parallel_size) makes it so that you can't set --data-parallel-hybrid-lb on a single node, which is really annoying since then you have to have different command line args for the single and multinode cases

Signed-off-by: Tyler Michael Smith <[email protected]>
@simon-mo simon-mo merged commit d5b981f into vllm-project:main Jul 24, 2025
96 of 97 checks passed
DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 28, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: 董巍 <[email protected]>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: avigny <[email protected]>
wenscarl pushed a commit to wenscarl/vllm that referenced this pull request Aug 4, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: shuw <[email protected]>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: x22x22 <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…ect#21238)

Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Paul Pak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build frontend ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Support One Pod Per Node LB for DP/EP
4 participants