Skip to content

[BUG]: Performance degradation in 0.7.0 #4774

@Anthony24601

Description

@Anthony24601

Describe the Bug

I'm seeing a performance degradation between dynamo 0.6.1 and 0.7.0

I am using LMBenchmark with 30 users on their synthetic multi qa test. I ONLY test 2 QPS
I configured a /raid device with the following fio results:

Finally, I set it up for disk offload. In 0.6.1 I was getting a TTFT of 1.27 seconds. Now I'm seeing a TTFT of 8.23 seconds.

Steps to Reproduce

Pull from main
start dynamo container using ./container/run.sh script.

Once inside, start vllm with dynamo:
DYN_KVBM_CPU_CACHE_GB=50
DYN_KVBM_DISK_CACHE_GB=500
DYN_KVBM_DISK_CACHE_DIR=/raid
DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=1
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=120
vllm serve --port 8000
--block-size 128
--gpu-memory-utilization 0.7
--disable-log-requests
--kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}'
Qwen/Qwen3-8B

I'm on a H100 machine so the above command gives the following stats:

GPU Utilization: 0.7 (70%)
G1 (GPU): 38 GiB
G2 (CPU): 50 GB
G3 (Disk): 500 GB
Users: 30 => 96 GB
Rounds: 20
Model: Qwen3-8B

Clone LMBenchmark's tests: https://github.com/LMCache/LMBenchmark
in LMBenchmark/synthetic-multi-round-qa
run
long_input_short_output_run.sh Qwen/Qwen3-8B http://localhost:8000 /tmp 2

Note: you must modify long_input_short_output_run.sh to use 30 users instead of 15 users otherwise you will never offload to disk with settings above.

Expected Behavior

Under 1.5 seconds

Actual Behavior

12.7 seconds

Environment

H100
Ubuntu 22.04.4 LTS

NV_LIBCUBLAS_VERSION=12.8.4.1-1
NVIDIA_VISIBLE_DEVICES=all
NV_NVTX_VERSION=12.8.90-1
NV_LIBCUSPARSE_VERSION=12.5.8.93-1
NV_LIBNPP_VERSION=12.3.3.100-1
NCCL_VERSION=2.25.1-1
PWD=/workspace
NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu
NIXL_PREFIX=/opt/nvidia/nvda_nixl
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NV_LIBNPP_PACKAGE=libnpp-12-8=12.3.3.100-1
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=12.8.90-1
HOME=/home/dynamo
VIRTUAL_ENV=/opt/dynamo/venv
CUDA_VERSION=12.8.1
NV_LIBCUBLAS_PACKAGE=libcublas-12-8=12.8.4.1-1
DYNAMO_COMMIT_SHA=f49d6873e417ef82090ed492ef00b6939bd5a8d0
NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu/plugins
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-8
TERM=xterm
SHLVL=1
NV_CUDA_LIB_VERSION=12.8.1-1
NVARCH=x86_64
VIRTUAL_ENV_PROMPT=venv
NV_LIBNCCL_PACKAGE=libnccl2=2.25.1-1+cuda12.8
LD_LIBRARY_PATH=/opt/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_install/lib:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu:/opt/nvidia/nvda_nixl/lib/x86_64-linux-gnu/plugins:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:/usr/local/cuda/lib64
PS1=[\e]0;\u@\h: \w\a]${debian_chroot:+($debian_chroot)}\u@\h:\w$
PATH=/opt/dynamo/venv/bin:/usr/local/ucx/bin:/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:/opt/dynamo/venv/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NV_LIBNCCL_PACKAGE_NAME=libnccl2
DYNAMO_HOME=/opt/dynamo
NV_LIBNCCL_PACKAGE_VERSION=2.25.1-1
CPATH=/usr/local/cuda/include

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions