-
Notifications
You must be signed in to change notification settings - Fork 730
Description
Describe the Bug
I'm seeing a performance degradation between dynamo 0.6.1 and 0.7.0
I am using LMBenchmark with 30 users on their synthetic multi qa test. I ONLY test 2 QPS
I configured a /raid device with the following fio results:
Finally, I set it up for disk offload. In 0.6.1 I was getting a TTFT of 1.27 seconds. Now I'm seeing a TTFT of 8.23 seconds.
Steps to Reproduce
Pull from main
start dynamo container using ./container/run.sh script.
Once inside, start vllm with dynamo:
DYN_KVBM_CPU_CACHE_GB=50
DYN_KVBM_DISK_CACHE_GB=500
DYN_KVBM_DISK_CACHE_DIR=/raid
DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=1
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=120
vllm serve --port 8000
--block-size 128
--gpu-memory-utilization 0.7
--disable-log-requests
--kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}'
Qwen/Qwen3-8B
I'm on a H100 machine so the above command gives the following stats:
GPU Utilization: 0.7 (70%)
G1 (GPU): 38 GiB
G2 (CPU): 50 GB
G3 (Disk): 500 GB
Users: 30 => 96 GB
Rounds: 20
Model: Qwen3-8B
Clone LMBenchmark's tests: https://github.com/LMCache/LMBenchmark
in LMBenchmark/synthetic-multi-round-qa
run
long_input_short_output_run.sh Qwen/Qwen3-8B http://localhost:8000 /tmp 2
Note: you must modify long_input_short_output_run.sh to use 30 users instead of 15 users otherwise you will never offload to disk with settings above.
Expected Behavior
Under 1.5 seconds
Actual Behavior
12.7 seconds
Environment
H100
Ubuntu 22.04.4 LTS
NV_LIBCUBLAS_VERSION=12.8.4.1-1
NVIDIA_VISIBLE_DEVICES=all
NV_NVTX_VERSION=12.8.90-1
NV_LIBCUSPARSE_VERSION=12.5.8.93-1
NV_LIBNPP_VERSION=12.3.3.100-1
NCCL_VERSION=2.25.1-1
PWD=/workspace
NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu
NIXL_PREFIX=/opt/nvidia/nvda_nixl
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NV_LIBNPP_PACKAGE=libnpp-12-8=12.3.3.100-1
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=12.8.90-1
HOME=/home/dynamo
VIRTUAL_ENV=/opt/dynamo/venv
CUDA_VERSION=12.8.1
NV_LIBCUBLAS_PACKAGE=libcublas-12-8=12.8.4.1-1
DYNAMO_COMMIT_SHA=f49d6873e417ef82090ed492ef00b6939bd5a8d0
NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu/plugins
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-8
TERM=xterm
SHLVL=1
NV_CUDA_LIB_VERSION=12.8.1-1
NVARCH=x86_64
VIRTUAL_ENV_PROMPT=venv
NV_LIBNCCL_PACKAGE=libnccl2=2.25.1-1+cuda12.8
LD_LIBRARY_PATH=/opt/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu/plugins:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:/usr/local/cuda/lib64
PS1=[\e]0;\u@\h: \w\a]${debian_chroot:+(
PATH=/opt/dynamo/venv/bin:/usr/local/ucx/bin:/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:/opt/dynamo/venv/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NV_LIBNCCL_PACKAGE_NAME=libnccl2
DYNAMO_HOME=/opt/dynamo
NV_LIBNCCL_PACKAGE_VERSION=2.25.1-1
CPATH=/usr/local/cuda/include
Additional Context
No response
Screenshots
No response