-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
[P/D] NIXL Integration #17751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
simon-mo
merged 100 commits into
vllm-project:main
from
robertgshaw2-redhat:upstream-nixl-clean
May 12, 2025
+2,724
−109
Merged
[P/D] NIXL Integration #17751
Changes from 43 commits
Commits
Show all changes
100 commits
Select commit
Hold shift + click to select a range
f1575de
[P/D Disagg] Direct NIXL Connector (#60)
tlrmchlsmth c90a2c8
fix failing tests (#64)
robertgshaw2-redhat 91040e3
NIXL Runtime Handshake (#63)
robertgshaw2-redhat b1d83f9
[V1] Support multiple kv connectors (#61)
njhill a904820
[Disagg PD] Add test for xPyD (#66)
tlrmchlsmth 8061a5c
[Test] Improve MultiConnector test (#69)
njhill 06847be
[P/D Disagg] [1/N] Support Homogeneous TP > 1 (#65)
robertgshaw2-redhat 8beac5e
[PD Disagg] Cruft / Minor Mem Leak (#71)
robertgshaw2-redhat 2c50275
[P/D Disagg] NIXL MLA (#70)
tlrmchlsmth 8ce4c07
[Bugfix] Fix env name for VLLM_NIXL_SIDE_CHANNEL_HOST (#73)
robertgshaw2-redhat bf0be1b
Merge branch 'main' into disagg_pd_dev
tlrmchlsmth 3783696
Merge pull request #76 from neuralmagic/disagg_pd_dev_merge_main
tlrmchlsmth 42b869e
updated
robertgshaw2-redhat bb2abeb
fixup testing
robertgshaw2-redhat b2de5e9
remove multi-connector
robertgshaw2-redhat 391a94a
remove multi-connector
robertgshaw2-redhat 42d6d26
remove multi-connector
robertgshaw2-redhat 527fbf1
cleanup paths
robertgshaw2-redhat 40fd5b0
cleanup paths
robertgshaw2-redhat 79561f4
cleanup paths
robertgshaw2-redhat ccd356a
cleanup paths
robertgshaw2-redhat 43adf8e
cleanup paths
robertgshaw2-redhat e98b512
cleanup paths
robertgshaw2-redhat 84c1379
cleanup paths
robertgshaw2-redhat e3e1738
cleanup
robertgshaw2-redhat cd2484d
cleanup
robertgshaw2-redhat 925de01
cleanup
robertgshaw2-redhat 61de85e
cleanup
robertgshaw2-redhat 0c80607
cleanup
robertgshaw2-redhat e1e03cb
cleanup
robertgshaw2-redhat 49faaf1
cleanup
robertgshaw2-redhat b3ef87c
cleanup
robertgshaw2-redhat 3625bc2
cleanup
robertgshaw2-redhat 67b4e50
cleanup
robertgshaw2-redhat 64bbff1
cleanup
robertgshaw2-redhat ef349ef
updated
robertgshaw2-redhat 80a8c79
updated
robertgshaw2-redhat 7d5444c
updated
robertgshaw2-redhat ff0572e
close but no cigar
robertgshaw2-redhat 9e7ee72
correctness!
robertgshaw2-redhat 5c3fc88
correctness!
robertgshaw2-redhat 5fd2138
Generalize/streamline async loading (remote prefill) side
njhill bb58f7c
Merge pull request #9 from njhill/abstract-async-load
robertgshaw2-redhat e673bdd
Move new GPUModelRunner methods out of execute_model method
njhill be6879d
Merge remote-tracking branch 'upstream/main' into upstream-nixl-clean
robertgshaw2-redhat 48add41
updated
robertgshaw2-redhat 3050565
updated
robertgshaw2-redhat c2f2e77
Merge pull request #10 from njhill/streamline-runner
robertgshaw2-redhat 2d44ff7
updated
robertgshaw2-redhat 6ee57bd
updated
robertgshaw2-redhat dd0c0ca
updated
robertgshaw2-redhat a57ec2d
support prefill instance disabling prefix caching
robertgshaw2-redhat 0d7f3c8
updated
robertgshaw2-redhat 816c3fa
poc with local prefix caching seems to be working
robertgshaw2-redhat 4add1a6
remove debug cruft
robertgshaw2-redhat 1616b15
updated
robertgshaw2-redhat 71a09f3
remove unnecessary send_kv_no_op list
njhill e4e650d
added more tests
robertgshaw2-redhat 8433c0e
added more tests
robertgshaw2-redhat 14e46ba
update comment
njhill a9ec035
Merge branch 'upstream-nixl-clean' of https://github.com/robertgshaw2…
robertgshaw2-redhat 5bcd191
updated
robertgshaw2-redhat 9812ecd
Abstract async saving
njhill 70f3ed5
Merge pull request #11 from njhill/abstract-async-save
robertgshaw2-redhat 1ce8fd9
fix nits
robertgshaw2-redhat c5dc489
remove spurious change
robertgshaw2-redhat 5a4150e
Arrange methods in base.py in correct worker/scheduler groupings
njhill 5389ef5
cleanup tests
robertgshaw2-redhat e8d961d
cleanup tests
robertgshaw2-redhat 7514ce0
update comment
robertgshaw2-redhat 4abf0c5
cleanup
robertgshaw2-redhat 4a50103
cleanup
robertgshaw2-redhat 51ed361
cleanup
robertgshaw2-redhat 9b4433b
cleanup
robertgshaw2-redhat ccbc114
Merge branch 'main' into upstream-nixl-clean
robertgshaw2-redhat 28e9a7c
updated
robertgshaw2-redhat 63f17f6
updated
robertgshaw2-redhat f60e650
updated
robertgshaw2-redhat f3fdfe8
reduce spurious changes
robertgshaw2-redhat 287b334
updated
robertgshaw2-redhat d35b116
updated
robertgshaw2-redhat 5ed8c2f
updated
robertgshaw2-redhat 6baecd7
updated
robertgshaw2-redhat f784c19
fix mock issue
robertgshaw2-redhat 7a27ffe
pass generate-finished request ids to worker connector
njhill f0ae12f
stahs
robertgshaw2-redhat ade6c29
fix test
robertgshaw2-redhat bd02803
Merge branch 'upstream-nixl-clean' of https://github.com/robertgshaw2…
robertgshaw2-redhat 73cea2d
fix typing
robertgshaw2-redhat 4d66549
Merge branch 'main' into upstream-nixl-clean
robertgshaw2-redhat d6fdf57
Add kv_transfer_params to chat completion endpoints
nerdalert 70c07de
Merge pull request #12 from nerdalert/add-chat-kvparams
robertgshaw2-redhat e7066cf
fix edge case
robertgshaw2-redhat 241facb
fix edge case
robertgshaw2-redhat f665786
mypy
robertgshaw2-redhat 626bb11
added edge case tests
robertgshaw2-redhat 4f0a5e9
added edge case tests
robertgshaw2-redhat 31bf313
added edge case tests
robertgshaw2-redhat 56937cd
added edge case tests
robertgshaw2-redhat 42b1ad0
updated
robertgshaw2-redhat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
172 changes: 172 additions & 0 deletions
172
tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
#!/bin/bash | ||
set -xe | ||
|
||
# Models to run | ||
MODELS=( | ||
"Qwen/Qwen3-0.6B" | ||
"deepseek-ai/deepseek-vl2-tiny" | ||
) | ||
|
||
# Number of prefill and decode instances to create | ||
NUM_PREFILL_INSTANCES=${NUM_PREFILL_INSTANCES:-1} # Default to 1 | ||
NUM_DECODE_INSTANCES=${NUM_DECODE_INSTANCES:-2} # Default to 2 | ||
|
||
# Find the git repository root directory | ||
GIT_ROOT=$(git rev-parse --show-toplevel) | ||
|
||
# Trap the SIGINT signal (triggered by Ctrl+C) | ||
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT | ||
|
||
# Waits for vLLM to start. | ||
wait_for_server() { | ||
local port=$1 | ||
timeout 1200 bash -c " | ||
until curl -s localhost:${port}/v1/completions > /dev/null; do | ||
sleep 1 | ||
done" && return 0 || return 1 | ||
} | ||
|
||
# Function to clean up previous instances | ||
cleanup_instances() { | ||
echo "Cleaning up any running vLLM instances..." | ||
pkill -f "vllm serve" || true | ||
sleep 2 | ||
} | ||
|
||
# Handle to get model-specific arguments for deepseek | ||
get_model_args() { | ||
local model_name=$1 | ||
local extra_args="" | ||
|
||
if [[ "$model_name" == "deepseek-ai/deepseek-vl2-tiny" ]]; then | ||
extra_args="--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}' --trust-remote-code" | ||
fi | ||
|
||
echo "$extra_args" | ||
} | ||
|
||
|
||
# Function to run tests for a specific model | ||
run_tests_for_model() { | ||
local model_name=$1 | ||
echo "================================" | ||
echo "Testing model: $model_name" | ||
echo "================================" | ||
|
||
# Get model-specific arguments | ||
local model_args=$(get_model_args "$model_name") | ||
|
||
# Arrays to store all hosts and ports | ||
PREFILL_HOSTS=() | ||
PREFILL_PORTS=() | ||
DECODE_HOSTS=() | ||
DECODE_PORTS=() | ||
|
||
# Start prefill instances | ||
for i in $(seq 0 $((NUM_PREFILL_INSTANCES-1))); do | ||
# Calculate GPU ID - we'll distribute across available GPUs | ||
GPU_ID=$((i % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l))) | ||
# Calculate port number (base port + instance number) | ||
PORT=$((8100 + i)) | ||
# Calculate side channel port | ||
SIDE_CHANNEL_PORT=$((5559 + i)) | ||
|
||
echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT" | ||
|
||
# Build the command with or without model-specific args | ||
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \ | ||
--port $PORT \ | ||
--enforce-eager \ | ||
--disable-log-requests \ | ||
--gpu-memory-utilization 0.2 \ | ||
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'" | ||
|
||
if [ -n "$model_args" ]; then | ||
FULL_CMD="$BASE_CMD $model_args" | ||
else | ||
FULL_CMD="$BASE_CMD" | ||
fi | ||
|
||
eval "$FULL_CMD &" | ||
|
||
# Store host and port for proxy configuration | ||
PREFILL_HOSTS+=("localhost") | ||
PREFILL_PORTS+=($PORT) | ||
done | ||
|
||
# Start decode instances | ||
for i in $(seq 0 $((NUM_DECODE_INSTANCES-1))); do | ||
# Calculate GPU ID - we'll distribute across available GPUs, starting from after prefill GPUs | ||
GPU_ID=$(((i + NUM_PREFILL_INSTANCES) % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l))) | ||
# Calculate port number (base port + instance number) | ||
PORT=$((8200 + i)) | ||
# Calculate side channel port | ||
SIDE_CHANNEL_PORT=$((5659 + i)) | ||
|
||
echo "Starting decode instance $i on GPU $GPU_ID, port $PORT" | ||
|
||
# Build the command with or without model-specific args | ||
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \ | ||
--port $PORT \ | ||
--enforce-eager \ | ||
--disable-log-requests \ | ||
--gpu-memory-utilization 0.2 \ | ||
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'" | ||
|
||
if [ -n "$model_args" ]; then | ||
FULL_CMD="$BASE_CMD $model_args" | ||
else | ||
FULL_CMD="$BASE_CMD" | ||
fi | ||
|
||
eval "$FULL_CMD &" | ||
|
||
# Store host and port for proxy configuration | ||
DECODE_HOSTS+=("localhost") | ||
DECODE_PORTS+=($PORT) | ||
done | ||
|
||
# Wait for all instances to start | ||
for PORT in "${PREFILL_PORTS[@]}"; do | ||
echo "Waiting for prefill instance on port $PORT to start..." | ||
wait_for_server $PORT | ||
done | ||
|
||
for PORT in "${DECODE_PORTS[@]}"; do | ||
echo "Waiting for decode instance on port $PORT to start..." | ||
wait_for_server $PORT | ||
done | ||
|
||
# Build the command for the proxy server with all the hosts and ports | ||
PROXY_CMD="python ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --port 8192" | ||
|
||
# Add all prefill hosts and ports | ||
PROXY_CMD+=" --prefiller-hosts ${PREFILL_HOSTS[@]}" | ||
PROXY_CMD+=" --prefiller-ports ${PREFILL_PORTS[@]}" | ||
|
||
# Add all decode hosts and ports | ||
PROXY_CMD+=" --decoder-hosts ${DECODE_HOSTS[@]}" | ||
PROXY_CMD+=" --decoder-ports ${DECODE_PORTS[@]}" | ||
|
||
# Start the proxy server | ||
echo "Starting proxy server with command: $PROXY_CMD" | ||
$PROXY_CMD & | ||
|
||
# Wait for the proxy to start | ||
sleep 5 | ||
|
||
# Run lm eval for this model | ||
echo "Running tests for $model_name" | ||
TEST_MODEL=$model_name python -m pytest -s -x ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_accuracy.py | ||
|
||
# Clean up before running next model | ||
cleanup_instances | ||
sleep 3 | ||
} | ||
|
||
# Run tests for each model | ||
for model in "${MODELS[@]}"; do | ||
run_tests_for_model "$model" | ||
done | ||
|
||
echo "All tests completed!" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import os | ||
|
||
import lm_eval | ||
import openai | ||
|
||
BASE_URL = "http://localhost:8192/v1" | ||
NUM_CONCURRENT = 100 | ||
TASK = "gsm8k" | ||
FILTER = "exact_match,strict-match" | ||
RTOL = 0.03 | ||
|
||
# Model-specific expected values | ||
EXPECTED_VALUES = { | ||
"Qwen/Qwen3-0.6B": 0.41, | ||
"deepseek-ai/deepseek-vl2-tiny": 0.20, | ||
} | ||
|
||
SIMPLE_PROMPT = "The best part about working on vLLM is that I got to meet so many people across various different organizations like UCB, Google, and Meta which means", # noqa: E501 | ||
|
||
# Get model name from environment variable | ||
MODEL_NAME = os.environ.get("TEST_MODEL", "Qwen/Qwen3-0.6B") | ||
|
||
|
||
def run_simple_prompt(): | ||
client = openai.OpenAI(api_key="EMPTY", base_url=BASE_URL) | ||
completion = client.completions.create(model=MODEL_NAME, | ||
prompt=SIMPLE_PROMPT) | ||
|
||
print("-" * 50) | ||
print(f"Completion results for {MODEL_NAME}:") | ||
print(completion) | ||
print("-" * 50) | ||
|
||
|
||
def test_accuracy(): | ||
"""Run the end to end accuracy test.""" | ||
run_simple_prompt() | ||
|
||
model_args = (f"model={MODEL_NAME}," | ||
f"base_url={BASE_URL}/completions," | ||
f"num_concurrent={NUM_CONCURRENT},tokenized_requests=False") | ||
|
||
results = lm_eval.simple_evaluate( | ||
model="local-completions", | ||
model_args=model_args, | ||
tasks=TASK, | ||
) | ||
|
||
measured_value = results["results"][TASK][FILTER] | ||
expected_value = EXPECTED_VALUES.get(MODEL_NAME) | ||
|
||
if expected_value is None: | ||
print(f"Warning: No expected value found for {MODEL_NAME}. " | ||
"Skipping accuracy check.") | ||
print(f"Measured value: {measured_value}") | ||
return | ||
|
||
assert (measured_value - RTOL < expected_value | ||
and measured_value + RTOL > expected_value | ||
), f"Expected: {expected_value} | Measured: {measured_value}" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.