NVIDIA · Shixiaowei02 · Aug 13, 2025 · Aug 13, 2025
diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -336,15 +336,15 @@ cd cpp/build
 `disaggServerBenchmark` only supports `decoder-only` models.
 Here is the basic usage:
 ```
-export TRTLLM_USE_MPI_KVCACHE=1
+export TRTLLM_USE_UCX_KVCACHE=1
 mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
 --generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
 ```
 This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.
 
 for example:
 ```
-export TRTLLM_USE_MPI_KVCACHE=1
+export TRTLLM_USE_UCX_KVCACHE=1
 mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}
 
 # need 6 gpus and 7 processes to launch the benchmark.

@@ -66,17 +66,6 @@ A. Yes, it's recommended that different executor use different GPUs . We support
 
 ### Debugging FAQs
 
-*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
-
-A. please set `backendType` of `CacheTransceiverConfig`.
-```cpp
-ExecutorConfig executorConfig{...};
-
-executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
-```
-
-When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.
-
 *Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*
 
 A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.

@@ -277,7 +277,7 @@ We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indi
 
 ### Reproducing Steps
 
-We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated/slurm).
+We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated/slurm/benchmark).
 
 ## Future Work
 

@@ -124,10 +124,10 @@ From the `examples/cpp/executor/build` folder, you can also run the `executorExa
 ```
 ./executorExampleDisaggregated -h
 ```
-Note setting `TRTLLM_USE_MPI_KVCACHE=1` is required to run disaggregated executor.
+Note setting `TRTLLM_USE_UCX_KVCACHE=1` is required to run disaggregated executor.
 For example, you can run :
 ```
-export TRTLLM_USE_MPI_KVCACHE=1
+export TRTLLM_USE_UCX_KVCACHE=1
 
 mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv
 

@@ -0,0 +1,6 @@
+# The overlap scheduler for context servers is currently disabled, as it is
+# not yet supported in disaggregated context server architectures.
+disable_overlap_scheduler: True
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
@@ -0,0 +1,12 @@
+# Please replace `ctx_hostname` and `gen_hostname` with the actual addresses.
+hostname: localhost
+port: 8000
+backend: pytorch
+context_servers:
+  num_instances: 1
+  urls:
+      - "ctx_hostname:8001"
+generation_servers:
+  num_instances: 1
+  urls:
+      - "gen_hostname:8002"
@@ -0,0 +1,3 @@
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
@@ -0,0 +1,36 @@
+#!/bin/bash
+#SBATCH --partition=${partition}
+#SBATCH --account=${account}
+#SBATCH --job-name=${job_name}
+#SBATCH --time=02:00:00
+
+container_image=""
+mount_paths=""
+work_path=""
+ctx_port=8001
+gen_port=8002
+
+# The `container_image` must have the TensorRT-LLM wheel package pre-installed.
+# Once the task is successfully launched, an API service will be available externally at http://host_ip:PORT.
+# Launch a context with `tp_size=8` using two 4-GPU nodes.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 2 --ntasks-per-node=4 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tp_size 8 --host 0.0.0.0 --port ${ctx_port} --extra_llm_api_options ${work_path}/ctx_extra-llm-api-config.yaml" &
+
+# Launch a generation with `tp_size=4` using one 4-GPU node.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 1 --ntasks-per-node=4 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tp_size 8 --host 0.0.0.0 --port ${gen_port} --extra_llm_api_options ${work_path}/gen_extra-llm-api-config.yaml" &
+
+# Launch a proxy.
+# The above-mentioned value needs to be replaced with the IP address of the host machine accessible to external
+# clients, and filled in the `disagg_config.yaml` file.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 1 --ntasks-per-node=1 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve disaggregated -c ${work_path}/disagg_config.yaml"
@@ -17,7 +17,7 @@ Please note that:
 
 ### Core Scripts
 
-Note that, core implementation of the slurm scripts are included in `examples/disaggregated/slurm`.
+Note that, core implementation of the slurm scripts are included in `examples/disaggregated/slurm/benchmark`.
 
 1. `submit.sh` - Main entry point for submitting benchmark jobs
 2. `process_gen_iterlog.py` - Processes benchmark results and generates reports
@@ -35,8 +35,8 @@ Before running the scripts, ensure you have:
 ### Running Benchmarks
 
 ```bash
-# Refer to `examples/disaggregated/slurm/`
-# Please find the `disaggr_torch.slurm` script in the `examples/disaggregated/slurm/` directory.
+# Refer to `examples/disaggregated/slurm/benchmark/`
+# Please find the `disaggr_torch.slurm` script in the `examples/disaggregated/slurm/benchmark/` directory.
 # Make sure that SLURM parameters are correctly set in `disaggr_torch.slurm` before executing this script.
 ./submit.sh
 ```

@@ -1,6 +1,6 @@
 #!/bin/bash
 
-echo "Please find the \`disaggr_torch.slurm\` script in the \`examples/disaggregated/slurm/\` directory."
+echo "Please find the \`disaggr_torch.slurm\` script in the \`examples/disaggregated/slurm/benchmark/\` directory."
 
 partition=<partition>
 account=<account>