[RFC]: Elastic Expert Parallelism

### Motivation.

Expert parallelism (**EP**) is a key technique in enabling high-throughput, low-latency large scale LLM serving of Mixture-of-Expert models such as DeepSeek-V3/R1. However, EP today is static in vLLM as well as many other inference frameworks: when there are more requests that exceed the current serving capacity, vLLM could not scale up to meet the demand; and when there are less requests, vLLM could not scale down to reduce GPU usage and cost. The only viable solution today is to perform a complete restart with a new configuration, which is quite slow and would drop a lot of traffic.

In this RFC, we propose Elastic Expert Parallelism to address the above challenges. With Elastic EP, vLLM will be able to dynamically scale up or down based on workload fluctuations, with minimal interruption to serving.

### Proposed Change.

**Background**: Expert Parallelism is closely related to Data Parallel (**DP**) Attention, please refer to [[RFC]: Data Parallel Attention and Expert Parallel MoEs](https://github.com/vllm-project/vllm/issues/16037) for background.

From a high level, we propose to add the following functionality to support Elastic EP:
- Bring up new DP engine-cores (in the case of upscaling), or tear down a portion of old DP engine-cores (in the case of downscaling)
- Update the **states** of retained engine-cores

The **states** include:
- Distributed environment & communicators
  - Communicators include engine-core’s communicator (DP), as well as worker’s model parallel communicators (TP, PP, DP, EP)
- Model structure & weights
  - Rebalance expert loads upon scale up/down using Expert Parallel Load Balance (**EPLB**) algorithms.
- Cudagraphs and torch.compile caches

The overall flow is depicted in the following diagram. It shows upscaling from DP=4 to DP=8 as an example.

![Image](https://github.com/user-attachments/assets/08b4fe4f-685c-426b-b31f-01192054b030)

Main components:
- Engine-core and worker states management
  - Destroy & reinitialization of the states
- Upscale/downscale scheduling
  - We will leverage Ray EngineCoreActorManager as the coordinator, but it can be extended to other DP engine-core managers
  - Upscale: coordinate startup of new engine-cores and reinitialization of existing engine-cores
  - Downscale: coordinate shutdown of old engine-cores and reinitialization of retained engine-cores 
- New API Server endpoint to trigger upscale/downscale
  - Wire through the control plane

Extensions:
- Fault-tolerance: replacement of faulty engine-cores 
- Autoscaling: auto trigger of upscale/downscale
- Integrate with prefill disaggregation

### Milestones

#### Milestone 1: Basic Functionality

Support the EP scale up & scale down for a commonly used EP configuration (e.g., perplexity kernel):
- Retained engine-core state destroy & reinitialization
  - Distributed environment
  - Distributed communicators
  - Model structure & weights: including EPLB
  - Cudagraph & torch.compile caches
- New engine-core startup
  - KV cache initialization: use available GPU memory information from existing engine-core to skip expensive profiling 
- Unneeded engine-core shutdown
- Control plane
  - API server endpoint
  - DP engine-core scheduling: e.g. collective operations (from retained and new engine-cores) need to happen at the same time

#### Milestone 2: Optimizations

Optimize the upscale and downscale workflows to minimize service interruption:
- Overlap of new communicator construction and old communicator destruction
- Fast model reinitialization & weights loading via EPLB rebalancing
- Cudagraph creation optimization: e.g., lazy/async initialization, in-place update
- Torch.compile cache management optimizations: e.g., patch and reuse
- Improved sequencing and coordination of above steps

#### Milestone 3: Generally Available

Make elastic EP generally available:
- Support for multi-node and TP>1 
- Support for commonly used kernels & combinations: DeepEP, DeepGemm, etc.
- Support both internal and external DPLB
- Large scale serving: test with commonly used large scale setups 

#### Milestone 4: Extensions

Build on top of elastic EP to support more comprehensive functionalities:
- Fault-tolerance: replacement of faulty engine-cores, continue serving using remaining nodes while waiting for replacements
- Autoscaling: auto trigger of upscale/downscale
- Autoscaling scheduler policy to decide parallelism configuration, including DP/EP topology and expert layout, based on performance profiles and goals
- Scheduling to achieve above policy
- Integrate with prefill disaggregation


### Feedback Period.

_No response_

### CC List.

@libertyeagle @simon-mo @tlrmchlsmth @njhill @kouroshHakha 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Elastic Expert Parallelism #20323

Motivation.

Proposed Change.

Milestones

Milestone 1: Basic Functionality

Milestone 2: Optimizations

Milestone 3: Generally Available

Milestone 4: Extensions

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Elastic Expert Parallelism #20323

Description

Motivation.

Proposed Change.

Milestones

Milestone 1: Basic Functionality

Milestone 2: Optimizations

Milestone 3: Generally Available

Milestone 4: Extensions

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions