Skip to content

[RFC]: Elastic Expert Parallelism #20323

@ruisearch42

Description

@ruisearch42

Motivation.

Expert parallelism (EP) is a key technique in enabling high-throughput, low-latency large scale LLM serving of Mixture-of-Expert models such as DeepSeek-V3/R1. However, EP today is static in vLLM as well as many other inference frameworks: when there are more requests that exceed the current serving capacity, vLLM could not scale up to meet the demand; and when there are less requests, vLLM could not scale down to reduce GPU usage and cost. The only viable solution today is to perform a complete restart with a new configuration, which is quite slow and would drop a lot of traffic.

In this RFC, we propose Elastic Expert Parallelism to address the above challenges. With Elastic EP, vLLM will be able to dynamically scale up or down based on workload fluctuations, with minimal interruption to serving.

Proposed Change.

Background: Expert Parallelism is closely related to Data Parallel (DP) Attention, please refer to [RFC]: Data Parallel Attention and Expert Parallel MoEs for background.

From a high level, we propose to add the following functionality to support Elastic EP:

  • Bring up new DP engine-cores (in the case of upscaling), or tear down a portion of old DP engine-cores (in the case of downscaling)
  • Update the states of retained engine-cores

The states include:

  • Distributed environment & communicators
    • Communicators include engine-core’s communicator (DP), as well as worker’s model parallel communicators (TP, PP, DP, EP)
  • Model structure & weights
    • Rebalance expert loads upon scale up/down using Expert Parallel Load Balance (EPLB) algorithms.
  • Cudagraphs and torch.compile caches

The overall flow is depicted in the following diagram. It shows upscaling from DP=4 to DP=8 as an example.

Image

Main components:

  • Engine-core and worker states management
    • Destroy & reinitialization of the states
  • Upscale/downscale scheduling
    • We will leverage Ray EngineCoreActorManager as the coordinator, but it can be extended to other DP engine-core managers
    • Upscale: coordinate startup of new engine-cores and reinitialization of existing engine-cores
    • Downscale: coordinate shutdown of old engine-cores and reinitialization of retained engine-cores
  • New API Server endpoint to trigger upscale/downscale
    • Wire through the control plane

Extensions:

  • Fault-tolerance: replacement of faulty engine-cores
  • Autoscaling: auto trigger of upscale/downscale
  • Integrate with prefill disaggregation

Milestones

Milestone 1: Basic Functionality

Support the EP scale up & scale down for a commonly used EP configuration (e.g., perplexity kernel):

  • Retained engine-core state destroy & reinitialization
    • Distributed environment
    • Distributed communicators
    • Model structure & weights: including EPLB
    • Cudagraph & torch.compile caches
  • New engine-core startup
    • KV cache initialization: use available GPU memory information from existing engine-core to skip expensive profiling
  • Unneeded engine-core shutdown
  • Control plane
    • API server endpoint
    • DP engine-core scheduling: e.g. collective operations (from retained and new engine-cores) need to happen at the same time

Milestone 2: Optimizations

Optimize the upscale and downscale workflows to minimize service interruption:

  • Overlap of new communicator construction and old communicator destruction
  • Fast model reinitialization & weights loading via EPLB rebalancing
  • Cudagraph creation optimization: e.g., lazy/async initialization, in-place update
  • Torch.compile cache management optimizations: e.g., patch and reuse
  • Improved sequencing and coordination of above steps

Milestone 3: Generally Available

Make elastic EP generally available:

  • Support for multi-node and TP>1
  • Support for commonly used kernels & combinations: DeepEP, DeepGemm, etc.
  • Support both internal and external DPLB
  • Large scale serving: test with commonly used large scale setups

Milestone 4: Extensions

Build on top of elastic EP to support more comprehensive functionalities:

  • Fault-tolerance: replacement of faulty engine-cores, continue serving using remaining nodes while waiting for replacements
  • Autoscaling: auto trigger of upscale/downscale
  • Autoscaling scheduler policy to decide parallelism configuration, including DP/EP topology and expert layout, based on performance profiles and goals
  • Scheduling to achieve above policy
  • Integrate with prefill disaggregation

Feedback Period.

No response

CC List.

@libertyeagle @simon-mo @tlrmchlsmth @njhill @kouroshHakha

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions