Skip to content

[P/D] NIXL Integration #17751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 100 commits into from
May 12, 2025
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
f1575de
[P/D Disagg] Direct NIXL Connector (#60)
tlrmchlsmth May 3, 2025
c90a2c8
fix failing tests (#64)
robertgshaw2-redhat May 3, 2025
91040e3
NIXL Runtime Handshake (#63)
robertgshaw2-redhat May 3, 2025
b1d83f9
[V1] Support multiple kv connectors (#61)
njhill May 3, 2025
a904820
[Disagg PD] Add test for xPyD (#66)
tlrmchlsmth May 3, 2025
8061a5c
[Test] Improve MultiConnector test (#69)
njhill May 4, 2025
06847be
[P/D Disagg] [1/N] Support Homogeneous TP > 1 (#65)
robertgshaw2-redhat May 4, 2025
8beac5e
[PD Disagg] Cruft / Minor Mem Leak (#71)
robertgshaw2-redhat May 5, 2025
2c50275
[P/D Disagg] NIXL MLA (#70)
tlrmchlsmth May 5, 2025
8ce4c07
[Bugfix] Fix env name for VLLM_NIXL_SIDE_CHANNEL_HOST (#73)
robertgshaw2-redhat May 5, 2025
bf0be1b
Merge branch 'main' into disagg_pd_dev
tlrmchlsmth May 6, 2025
3783696
Merge pull request #76 from neuralmagic/disagg_pd_dev_merge_main
tlrmchlsmth May 6, 2025
42b869e
updated
robertgshaw2-redhat May 6, 2025
bb2abeb
fixup testing
robertgshaw2-redhat May 6, 2025
b2de5e9
remove multi-connector
robertgshaw2-redhat May 6, 2025
391a94a
remove multi-connector
robertgshaw2-redhat May 6, 2025
42d6d26
remove multi-connector
robertgshaw2-redhat May 6, 2025
527fbf1
cleanup paths
robertgshaw2-redhat May 6, 2025
40fd5b0
cleanup paths
robertgshaw2-redhat May 6, 2025
79561f4
cleanup paths
robertgshaw2-redhat May 6, 2025
ccd356a
cleanup paths
robertgshaw2-redhat May 6, 2025
43adf8e
cleanup paths
robertgshaw2-redhat May 7, 2025
e98b512
cleanup paths
robertgshaw2-redhat May 7, 2025
84c1379
cleanup paths
robertgshaw2-redhat May 7, 2025
e3e1738
cleanup
robertgshaw2-redhat May 7, 2025
cd2484d
cleanup
robertgshaw2-redhat May 7, 2025
925de01
cleanup
robertgshaw2-redhat May 7, 2025
61de85e
cleanup
robertgshaw2-redhat May 7, 2025
0c80607
cleanup
robertgshaw2-redhat May 7, 2025
e1e03cb
cleanup
robertgshaw2-redhat May 7, 2025
49faaf1
cleanup
robertgshaw2-redhat May 7, 2025
b3ef87c
cleanup
robertgshaw2-redhat May 7, 2025
3625bc2
cleanup
robertgshaw2-redhat May 7, 2025
67b4e50
cleanup
robertgshaw2-redhat May 7, 2025
64bbff1
cleanup
robertgshaw2-redhat May 7, 2025
ef349ef
updated
robertgshaw2-redhat May 7, 2025
80a8c79
updated
robertgshaw2-redhat May 7, 2025
7d5444c
updated
robertgshaw2-redhat May 7, 2025
ff0572e
close but no cigar
robertgshaw2-redhat May 7, 2025
9e7ee72
correctness!
robertgshaw2-redhat May 7, 2025
5c3fc88
correctness!
robertgshaw2-redhat May 7, 2025
5fd2138
Generalize/streamline async loading (remote prefill) side
njhill May 7, 2025
bb58f7c
Merge pull request #9 from njhill/abstract-async-load
robertgshaw2-redhat May 8, 2025
e673bdd
Move new GPUModelRunner methods out of execute_model method
njhill May 8, 2025
be6879d
Merge remote-tracking branch 'upstream/main' into upstream-nixl-clean
robertgshaw2-redhat May 8, 2025
48add41
updated
robertgshaw2-redhat May 8, 2025
3050565
updated
robertgshaw2-redhat May 8, 2025
c2f2e77
Merge pull request #10 from njhill/streamline-runner
robertgshaw2-redhat May 8, 2025
2d44ff7
updated
robertgshaw2-redhat May 8, 2025
6ee57bd
updated
robertgshaw2-redhat May 8, 2025
dd0c0ca
updated
robertgshaw2-redhat May 8, 2025
a57ec2d
support prefill instance disabling prefix caching
robertgshaw2-redhat May 8, 2025
0d7f3c8
updated
robertgshaw2-redhat May 8, 2025
816c3fa
poc with local prefix caching seems to be working
robertgshaw2-redhat May 8, 2025
4add1a6
remove debug cruft
robertgshaw2-redhat May 8, 2025
1616b15
updated
robertgshaw2-redhat May 8, 2025
71a09f3
remove unnecessary send_kv_no_op list
njhill May 8, 2025
e4e650d
added more tests
robertgshaw2-redhat May 8, 2025
8433c0e
added more tests
robertgshaw2-redhat May 8, 2025
14e46ba
update comment
njhill May 8, 2025
a9ec035
Merge branch 'upstream-nixl-clean' of https://github.com/robertgshaw2…
robertgshaw2-redhat May 8, 2025
5bcd191
updated
robertgshaw2-redhat May 8, 2025
9812ecd
Abstract async saving
njhill May 8, 2025
70f3ed5
Merge pull request #11 from njhill/abstract-async-save
robertgshaw2-redhat May 9, 2025
1ce8fd9
fix nits
robertgshaw2-redhat May 9, 2025
c5dc489
remove spurious change
robertgshaw2-redhat May 9, 2025
5a4150e
Arrange methods in base.py in correct worker/scheduler groupings
njhill May 9, 2025
5389ef5
cleanup tests
robertgshaw2-redhat May 10, 2025
e8d961d
cleanup tests
robertgshaw2-redhat May 10, 2025
7514ce0
update comment
robertgshaw2-redhat May 10, 2025
4abf0c5
cleanup
robertgshaw2-redhat May 10, 2025
4a50103
cleanup
robertgshaw2-redhat May 10, 2025
51ed361
cleanup
robertgshaw2-redhat May 10, 2025
9b4433b
cleanup
robertgshaw2-redhat May 10, 2025
ccbc114
Merge branch 'main' into upstream-nixl-clean
robertgshaw2-redhat May 10, 2025
28e9a7c
updated
robertgshaw2-redhat May 10, 2025
63f17f6
updated
robertgshaw2-redhat May 10, 2025
f60e650
updated
robertgshaw2-redhat May 10, 2025
f3fdfe8
reduce spurious changes
robertgshaw2-redhat May 10, 2025
287b334
updated
robertgshaw2-redhat May 10, 2025
d35b116
updated
robertgshaw2-redhat May 10, 2025
5ed8c2f
updated
robertgshaw2-redhat May 10, 2025
6baecd7
updated
robertgshaw2-redhat May 10, 2025
f784c19
fix mock issue
robertgshaw2-redhat May 10, 2025
7a27ffe
pass generate-finished request ids to worker connector
njhill May 10, 2025
f0ae12f
stahs
robertgshaw2-redhat May 10, 2025
ade6c29
fix test
robertgshaw2-redhat May 10, 2025
bd02803
Merge branch 'upstream-nixl-clean' of https://github.com/robertgshaw2…
robertgshaw2-redhat May 10, 2025
73cea2d
fix typing
robertgshaw2-redhat May 10, 2025
4d66549
Merge branch 'main' into upstream-nixl-clean
robertgshaw2-redhat May 11, 2025
d6fdf57
Add kv_transfer_params to chat completion endpoints
nerdalert May 11, 2025
70c07de
Merge pull request #12 from nerdalert/add-chat-kvparams
robertgshaw2-redhat May 12, 2025
e7066cf
fix edge case
robertgshaw2-redhat May 12, 2025
241facb
fix edge case
robertgshaw2-redhat May 12, 2025
f665786
mypy
robertgshaw2-redhat May 12, 2025
626bb11
added edge case tests
robertgshaw2-redhat May 12, 2025
4f0a5e9
added edge case tests
robertgshaw2-redhat May 12, 2025
31bf313
added edge case tests
robertgshaw2-redhat May 12, 2025
56937cd
added edge case tests
robertgshaw2-redhat May 12, 2025
42b1ad0
updated
robertgshaw2-redhat May 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ steps:
- pytest -v -s v1/worker
- pytest -v -s v1/structured_output
- pytest -v -s v1/spec_decode
- pytest -v -s v1/kv_connector/unit
- pytest -v -s v1/test_serial_utils.py
- pytest -v -s v1/test_stats.py
- pytest -v -s v1/test_utils.py
Expand Down
171 changes: 171 additions & 0 deletions tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
#!/bin/bash
set -xe

# Models to run
MODELS=(
"Qwen/Qwen3-0.6B"
)

# Number of prefill and decode instances to create
NUM_PREFILL_INSTANCES=${NUM_PREFILL_INSTANCES:-1} # Default to 1
NUM_DECODE_INSTANCES=${NUM_DECODE_INSTANCES:-2} # Default to 2

# Find the git repository root directory
GIT_ROOT=$(git rev-parse --show-toplevel)

# Trap the SIGINT signal (triggered by Ctrl+C)
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT

# Waits for vLLM to start.
wait_for_server() {
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}

# Function to clean up previous instances
cleanup_instances() {
echo "Cleaning up any running vLLM instances..."
pkill -f "vllm serve" || true
sleep 2
}

# Handle to get model-specific arguments for deepseek
get_model_args() {
local model_name=$1
local extra_args=""

if [[ "$model_name" == "deepseek-ai/deepseek-vl2-tiny" ]]; then
extra_args="--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}' --trust-remote-code"
fi

echo "$extra_args"
}


# Function to run tests for a specific model
run_tests_for_model() {
local model_name=$1
echo "================================"
echo "Testing model: $model_name"
echo "================================"

# Get model-specific arguments
local model_args=$(get_model_args "$model_name")

# Arrays to store all hosts and ports
PREFILL_HOSTS=()
PREFILL_PORTS=()
DECODE_HOSTS=()
DECODE_PORTS=()

# Start prefill instances
for i in $(seq 0 $((NUM_PREFILL_INSTANCES-1))); do
# Calculate GPU ID - we'll distribute across available GPUs
GPU_ID=$((i % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)))
# Calculate port number (base port + instance number)
PORT=$((8100 + i))
# Calculate side channel port
SIDE_CHANNEL_PORT=$((5559 + i))

echo "Starting prefill instance $i on GPU $GPU_ID, port $PORT"

# Build the command with or without model-specific args
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
--port $PORT \
--enforce-eager \
--disable-log-requests \
--gpu-memory-utilization 0.2 \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
else
FULL_CMD="$BASE_CMD"
fi

eval "$FULL_CMD &"

# Store host and port for proxy configuration
PREFILL_HOSTS+=("localhost")
PREFILL_PORTS+=($PORT)
done

# Start decode instances
for i in $(seq 0 $((NUM_DECODE_INSTANCES-1))); do
# Calculate GPU ID - we'll distribute across available GPUs, starting from after prefill GPUs
GPU_ID=$(((i + NUM_PREFILL_INSTANCES) % $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)))
# Calculate port number (base port + instance number)
PORT=$((8200 + i))
# Calculate side channel port
SIDE_CHANNEL_PORT=$((5659 + i))

echo "Starting decode instance $i on GPU $GPU_ID, port $PORT"

# Build the command with or without model-specific args
BASE_CMD="CUDA_VISIBLE_DEVICES=$GPU_ID VLLM_NIXL_SIDE_CHANNEL_PORT=$SIDE_CHANNEL_PORT vllm serve $model_name \
--port $PORT \
--enforce-eager \
--disable-log-requests \
--gpu-memory-utilization 0.2 \
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"

if [ -n "$model_args" ]; then
FULL_CMD="$BASE_CMD $model_args"
else
FULL_CMD="$BASE_CMD"
fi

eval "$FULL_CMD &"

# Store host and port for proxy configuration
DECODE_HOSTS+=("localhost")
DECODE_PORTS+=($PORT)
done

# Wait for all instances to start
for PORT in "${PREFILL_PORTS[@]}"; do
echo "Waiting for prefill instance on port $PORT to start..."
wait_for_server $PORT
done

for PORT in "${DECODE_PORTS[@]}"; do
echo "Waiting for decode instance on port $PORT to start..."
wait_for_server $PORT
done

# Build the command for the proxy server with all the hosts and ports
PROXY_CMD="python ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --port 8192"

# Add all prefill hosts and ports
PROXY_CMD+=" --prefiller-hosts ${PREFILL_HOSTS[@]}"
PROXY_CMD+=" --prefiller-ports ${PREFILL_PORTS[@]}"

# Add all decode hosts and ports
PROXY_CMD+=" --decoder-hosts ${DECODE_HOSTS[@]}"
PROXY_CMD+=" --decoder-ports ${DECODE_PORTS[@]}"

# Start the proxy server
echo "Starting proxy server with command: $PROXY_CMD"
$PROXY_CMD &

# Wait for the proxy to start
sleep 5

# Run lm eval for this model
echo "Running tests for $model_name"
TEST_MODEL=$model_name python -m pytest -s -x ${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_accuracy.py

# Clean up before running next model
cleanup_instances
sleep 3
}

# Run tests for each model
for model in "${MODELS[@]}"; do
run_tests_for_model "$model"
done

echo "All tests completed!"
60 changes: 60 additions & 0 deletions tests/v1/kv_connector/nixl_integration/test_accuracy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# SPDX-License-Identifier: Apache-2.0
import os

import lm_eval
import openai

BASE_URL = "http://localhost:8192/v1"
NUM_CONCURRENT = 100
TASK = "gsm8k"
FILTER = "exact_match,strict-match"
RTOL = 0.03

# Model-specific expected values
EXPECTED_VALUES = {
"Qwen/Qwen3-0.6B": 0.41,
}

SIMPLE_PROMPT = "The best part about working on vLLM is that I got to meet so many people across various different organizations like UCB, Google, and Meta which means", # noqa: E501

# Get model name from environment variable
MODEL_NAME = os.environ.get("TEST_MODEL", "Qwen/Qwen3-0.6B")


def run_simple_prompt():
client = openai.OpenAI(api_key="EMPTY", base_url=BASE_URL)
completion = client.completions.create(model=MODEL_NAME,
prompt=SIMPLE_PROMPT)

print("-" * 50)
print(f"Completion results for {MODEL_NAME}:")
print(completion)
print("-" * 50)


def test_accuracy():
"""Run the end to end accuracy test."""
run_simple_prompt()

model_args = (f"model={MODEL_NAME},"
f"base_url={BASE_URL}/completions,"
f"num_concurrent={NUM_CONCURRENT},tokenized_requests=False")

results = lm_eval.simple_evaluate(
model="local-completions",
model_args=model_args,
tasks=TASK,
)

measured_value = results["results"][TASK][FILTER]
expected_value = EXPECTED_VALUES.get(MODEL_NAME)

if expected_value is None:
print(f"Warning: No expected value found for {MODEL_NAME}. "
"Skipping accuracy check.")
print(f"Measured value: {measured_value}")
return

assert (measured_value - RTOL < expected_value
and measured_value + RTOL > expected_value
), f"Expected: {expected_value} | Measured: {measured_value}"
Loading