-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
[DP] Internal Load Balancing Per Node [one-pod-per-node
]
#21238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DP] Internal Load Balancing Per Node [one-pod-per-node
]
#21238
Conversation
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Robert Shaw <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces changes to support internal load balancing for data-parallel setups, specifically in a "one-pod-per-node" configuration. The changes involve modifications to the engine argument parsing, distributed setup, and communication logic. The most critical issues are the presence of a "HACK" that hardcodes a key configuration variable and several instances of commented-out code and debugging artifacts (e.g., print
and logger.info
statements). These should be removed or replaced with proper, configurable implementations to ensure the code is clean, maintainable, and production-ready. Additionally, a todo
comment indicates that some parts of the code may be incomplete or require further updates. Please address these points to improve the quality and clarity of the codebase.
vllm/entrypoints/cli/serve.py
Outdated
# if args.data_parallel_start_rank: | ||
# raise ValueError( | ||
# "data_parallel_start_rank is only applicable " | ||
# "in headless mode. " | ||
# "Add --headless flag to enable headless mode.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation logic has been commented out. If this check is no longer required, the commented code should be removed. If the check is still necessary, it should be re-enabled.
# if args.data_parallel_start_rank: | |
# raise ValueError( | |
# "data_parallel_start_rank is only applicable " | |
# "in headless mode. " | |
# "Add --headless flag to enable headless mode.") | |
if args.data_parallel_start_rank: | |
raise ValueError( | |
"data_parallel_start_rank is only applicable " | |
"in headless mode. " | |
"Add --headless flag to enable headless mode.") |
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
@robertgshaw2-redhat I can spend some time on this tomorrow (Monday) if not before. This would be a third DP mode which is kind of a hybrid of the two existing ones. It will need some change to the coordinator and/or client load-balancing logic to constrain the set of engines considered to those associated with each API server. |
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Robert Shaw <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # .buildkite/test-pipeline.yaml
The test failure is just due to too strict tolerance for the balancing. We can wait for the remaining tests to finish and I can then push a change to relax the tolerance. |
vllm/engine/arg_utils.py
Outdated
# Use full external lb if we have local_size of 1. | ||
self.data_parallel_hybrid_lb = False | ||
elif self.data_parallel_size_local is not None and ( | ||
self.data_parallel_size_local != self.data_parallel_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition (self.data_parallel_size_local != self.data_parallel_size
) makes it so that you can't set --data-parallel-hybrid-lb
on a single node, which is really annoying since then you have to have different command line args for the single and multinode cases
Signed-off-by: Tyler Michael Smith <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: 董巍 <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: avigny <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: shuw <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: x22x22 <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…ect#21238) Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
EngineCore
andAsyncLLM
(i.e. do not have to go over the network), while allowing us tollm-d
to avoid the UCX issues we have withone-pod-per-rank
balancingNOTE:
Resolves #21261
FOLLOW UPS:
--data-parallel-rank
UX for the old external LB to a unified setup (cc @njhill)DPCoordinator
to only send LB messagesTest Plan
-run concurrently
just eval 8100 100 1000
just eval 8200 100 1000
Test Result
NOTE:
Also confirmed old ways works:
(Optional) Documentation Update