Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,6 @@ YOUR_DATA_PATH=<your dataset file following the format>

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
enable_overlap_scheduler: true
use_cuda_graph: true
moe_backend: TRTLLM
speculative_config:
Expand Down Expand Up @@ -218,7 +217,6 @@ pytorch_backend_config:
- 256
- 384
print_iter_log: true
enable_overlap_scheduler: true
enable_attention_dp: true
EOF

Expand Down Expand Up @@ -260,7 +258,6 @@ YOUR_DATA_PATH=<your dataset file following the format>

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
enable_overlap_scheduler: true
use_cuda_graph: true
speculative_config:
decoding_type: MTP
Expand Down Expand Up @@ -314,7 +311,6 @@ pytorch_backend_config:
use_cuda_graph: true
cuda_graph_batch_sizes:
- 128
enable_overlap_scheduler: true
enable_attention_dp: true
EOF

Expand Down
4 changes: 2 additions & 2 deletions examples/disaggregated/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:

```
echo -e "pytorch_backend_config:\n enable_overlap_scheduler: False\ncache_transceiver_config:\n max_num_tokens: 2048" > context_extra-llm-api-config.yml
echo -e "pytorch_backend_config:\n disable_overlap_scheduler: True\ncache_transceiver_config:\n max_num_tokens: 2048" > context_extra-llm-api-config.yml
echo -e "cache_transceiver_config:\n max_num_tokens: 2048" > gen_extra-llm-api-config.yml

export TRTLLM_USE_UCX_KVCACHE=1
Expand Down Expand Up @@ -65,7 +65,7 @@ model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
backend: "pytorch"
pytorch_backend_config:
use_cuda_graph: False
enable_overlap_scheduler: False
disable_overlap_scheduler: True
context_servers:
num_instances: 1
tensor_parallel_size: 1
Expand Down
2 changes: 1 addition & 1 deletion examples/disaggregated/disagg_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ free_gpu_memory_fraction: 0.25
backend: "pytorch"
pytorch_backend_config:
use_cuda_graph: False
enable_overlap_scheduler: False
disable_overlap_scheduler: True
context_servers:
num_instances: 1
tensor_parallel_size: 1
Expand Down
3 changes: 1 addition & 2 deletions examples/llm-api/llm_inference_kv_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@


def main():
pytorch_config = PyTorchConfig(enable_overlap_scheduler=True,
autotuner_enabled=False,
pytorch_config = PyTorchConfig(autotuner_enabled=False,
kv_cache_dtype='auto')

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
Expand Down
1 change: 0 additions & 1 deletion examples/llm-api/llm_mgmn_trtllm_bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,6 @@ srun -l \
cat > /tmp/pytorch_extra_args.txt << EOF
pytorch_backend_config:
use_cuda_graph: false
enable_overlap_scheduler: true
cuda_graph_padding_enabled: false
print_iter_log: true
enable_attention_dp: false
Expand Down
9 changes: 4 additions & 5 deletions examples/models/core/deepseek_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
- [Quick Start](#quick-start)
- [Run a single inference](#run-a-single-inference)
- [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp)
- [Relaxed acceptance](#relaxed-acceptance)
- [Long context support](#long-context-support)
- [ISL-64k-OSL-1024](#isl-64k-osl-1024)
- [ISL-128k-OSL-1024](#isl-128k-osl-1024)
- [Evaluation](#evaluation)
- [Serving](#serving)
- [Advanced Usages](#advanced-usages)
Expand All @@ -34,6 +37,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
- [FP8 KV Cache and MLA](#fp8-kv-cache-and-mla)
- [W4AFP8](#w4afp8)
- [Notes and Troubleshooting](#notes-and-troubleshooting)
- [Known Issues](#known-issues)


## Hardware Requirements
Expand Down Expand Up @@ -134,7 +138,6 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \

cat <<EOF > /tmp/extra-llm-api-config.yml
pytorch_backend_config:
enable_overlap_scheduler: true
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes: [1, 4, 8, 12]
Expand Down Expand Up @@ -163,7 +166,6 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \

cat <<EOF > /tmp/extra-llm-api-config.yml
pytorch_backend_config:
enable_overlap_scheduler: true
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes: [1, 2]
Expand All @@ -190,7 +192,6 @@ Evaluate the model accuracy using `trtllm-eval`.
cat >./extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
enable_overlap_scheduler: true
enable_attention_dp: true
EOF
```
Expand Down Expand Up @@ -246,7 +247,6 @@ pytorch_backend_config:
- 256
- 384
print_iter_log: true
enable_overlap_scheduler: true
enable_attention_dp: true
EOF

Expand Down Expand Up @@ -417,7 +417,6 @@ pytorch_backend_config:
- 256
- 384
print_iter_log: true
enable_overlap_scheduler: true
enable_attention_dp: true
EOF
```
Expand Down
3 changes: 1 addition & 2 deletions examples/models/core/qwen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ This document shows how to build and run a [Qwen](https://huggingface.co/Qwen) m
- [Run a single inference](#run-a-single-inference)
- [Evaluation](#evaluation)
- [Serving](#serving)
- [Notes and Troubleshooting](#notes-and-troubleshooting)
- [Notes and Troubleshooting](#notes-and-troubleshooting)
- [Credits](#credits)

## Overview
Expand Down Expand Up @@ -668,7 +668,6 @@ pytorch_backend_config:
- 256
- 384
print_iter_log: true
enable_overlap_scheduler: true
enable_attention_dp: true
EOF

Expand Down
4 changes: 2 additions & 2 deletions examples/pytorch/quickstart_advanced.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def add_llm_args(parser):
parser.add_argument("--kv_cache_fraction", type=float, default=None)

# Runtime
parser.add_argument('--enable_overlap_scheduler',
parser.add_argument('--disable_overlap_scheduler',
default=False,
action='store_true')
parser.add_argument('--enable_chunked_prefill',
Expand Down Expand Up @@ -124,7 +124,7 @@ def parse_arguments():

def setup_llm(args):
pytorch_config = PyTorchConfig(
enable_overlap_scheduler=args.enable_overlap_scheduler,
disable_overlap_scheduler=args.disable_overlap_scheduler,
kv_cache_dtype=args.kv_cache_dtype,
attn_backend=args.attention_backend,
use_cuda_graph=args.use_cuda_graph,
Expand Down
2 changes: 1 addition & 1 deletion examples/scaffolding/run_best_of_n_with_reward.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def main():
max_batch_size=args.sample_num,
max_num_tokens=8192,
kv_cache_free_gpu_memory_fraction=0.2,
enable_overlap_scheduler=False)
disable_overlap_scheduler=True)
workers[NativeGenerationController.WorkerTag.GENERATION] = gen_worker
workers[QwenRewardController.WorkerTag.REWARD] = reward_worker

Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ def create_autodeploy_executor(
model_engine=engine,
decoder=decoder,
dist=mpi_dist,
enable_overlap_scheduler=py_config.enable_overlap_scheduler,
disable_overlap_scheduler=py_config.disable_overlap_scheduler,
max_input_len=executor_config.max_input_len,
max_batch_size=executor_config.max_batch_size,
max_draft_tokens=executor_config.speculative_config.max_draft_tokens
Expand Down
14 changes: 7 additions & 7 deletions tensorrt_llm/_torch/pyexecutor/_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ def create_py_executor_instance(dist,
if spec_config is not None:
raise ValueError(
"Guided decoding is not supported with speculative decoding.")
if pytorch_backend_config.enable_overlap_scheduler:
if not pytorch_backend_config.disable_overlap_scheduler:
raise ValueError(
"Guided decoding is not supported with overlap scheduler.")

Expand Down Expand Up @@ -415,7 +415,7 @@ def create_py_executor_instance(dist,
if mapping.has_pp():
num_micro_batches = mapping.pp_size
else:
num_micro_batches = 2 if pytorch_backend_config.enable_overlap_scheduler else 1
num_micro_batches = 1 if pytorch_backend_config.disable_overlap_scheduler else 2

resources["seq_slot_manager"] = SeqSlotManager(
executor_config.max_batch_size * num_micro_batches)
Expand Down Expand Up @@ -450,8 +450,8 @@ def create_py_executor_instance(dist,
model_engine=model_engine,
decoder=decoder,
dist=dist,
enable_overlap_scheduler=pytorch_backend_config.
enable_overlap_scheduler,
disable_overlap_scheduler=pytorch_backend_config.
disable_overlap_scheduler,
max_batch_size=executor_config.max_batch_size,
max_draft_tokens=spec_config.max_draft_tokens
if spec_config is not None else 0,
Expand All @@ -471,9 +471,9 @@ def instantiate_decoder(model_engine, executor_config, pytorch_backend_config,
spec_config=model_engine.spec_config)
elif pytorch_backend_config.enable_trtllm_decoder:
decoding_mode = get_decoding_mode(executor_config)
decoder = TRTLLMDecoder(executor_config, model_engine.model,
model_engine.dtype, mapping, decoding_mode,
pytorch_backend_config.enable_overlap_scheduler)
decoder = TRTLLMDecoder(
executor_config, model_engine.model, model_engine.dtype, mapping,
decoding_mode, pytorch_backend_config.disable_overlap_scheduler)
elif not model_engine.model.model_config.is_generation:
# NOTE: choose decoder based on model type
decoder = EarlyStopDecoder()
Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm/_torch/pyexecutor/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ class PyTorchConfig:
# If true, batches are rounded up to the nearest cuda_graph_batch_size.
# This is usually a net win for performance.
cuda_graph_padding_enabled: bool = False
enable_overlap_scheduler: bool = False
disable_overlap_scheduler: bool = False
# If set, at most moe_max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time.
# If the number of tokens exceeds moe_max_num_tokens, the input tensors will be split into chunks and a for loop will be used.
moe_max_num_tokens: Optional[int] = None
Expand Down
4 changes: 2 additions & 2 deletions tensorrt_llm/_torch/pyexecutor/decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,7 @@ def __init__(
model_dtype,
mapping: Mapping,
decoding_mode: DecodingMode,
enable_overlap_scheduler: bool,
disable_overlap_scheduler: bool,
):

vocab_size = model.config.vocab_size
Expand All @@ -468,7 +468,7 @@ def __init__(
self.max_num_sequences = mapping.pp_size * self.executor_config.max_batch_size
self.max_seq_idle_microseconds = 180 * 1000 * 1000
self.max_decoding_tokens = 1 # It must be 1 when not in speculative decoding
self.is_trt_overlap = enable_overlap_scheduler
self.is_trt_overlap = not disable_overlap_scheduler

self.world_config = WorldConfig.mpi(mapping.gpus_per_node,
mapping.tp_size, mapping.pp_size)
Expand Down
6 changes: 3 additions & 3 deletions tensorrt_llm/_torch/pyexecutor/model_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,7 +331,7 @@ def __init__(
layerwise_nvtx_marker.register_hooks(self.model, module_prefix)

self.enable_attention_dp = self.model.model_config.mapping.enable_attention_dp
self._enable_overlap_scheduler = self.pytorch_backend_config.enable_overlap_scheduler
self._disable_overlap_scheduler = self.pytorch_backend_config.disable_overlap_scheduler
self._torch_compile_backend = None
self.dtype = self.model.config.torch_dtype
self._init_model_capacity()
Expand Down Expand Up @@ -982,7 +982,7 @@ def _preprocess_inputs(self, inputs: Dict[str, Any]):
"""
Make some changes to the device inputs and avoid block the async data transfer
"""
if self.is_spec_decode and self._enable_overlap_scheduler:
if self.is_spec_decode and not self._disable_overlap_scheduler:
# When enabling overlap scheduler, the kv cache for draft tokens will
# be prepared in advance by using the max_draft_len. But we need to use
# new_tokens_lens_device to get the real past kv lengths and the
Expand Down Expand Up @@ -1086,7 +1086,7 @@ def _prepare_tp_inputs(
dtype=torch.int32).to('cuda',
non_blocking=True))

if self._enable_overlap_scheduler and self.is_spec_decode:
if not self._disable_overlap_scheduler and self.is_spec_decode:
spec_dec_mode = self.spec_config.spec_dec_mode
assert spec_dec_mode.support_overlap_scheduler(
), f"{self.spec_config.spec_dec_name} does not support overlap scheduler"
Expand Down
8 changes: 4 additions & 4 deletions tensorrt_llm/_torch/pyexecutor/py_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ def __init__(self,
model_engine: ModelEngine,
decoder: Decoder,
dist: Distributed,
enable_overlap_scheduler: bool = False,
disable_overlap_scheduler: bool = False,
max_input_len: int = 2048,
max_batch_size: int = 8,
max_draft_tokens: int = 0,
Expand All @@ -187,7 +187,7 @@ def __init__(self,
self.enable_attention_dp = model_engine.enable_attention_dp
self.decoder = decoder
self.dist = dist
self.enable_overlap_scheduler = enable_overlap_scheduler
self.disable_overlap_scheduler = disable_overlap_scheduler

# Draft model for certain spec decode algorithms, e.g. EAGLE3
self.draft_model_engine = draft_model_engine
Expand Down Expand Up @@ -258,7 +258,7 @@ def __init__(self,
if self.dist.pp_size > 1:
self.event_loop = self._executor_loop_pp
else:
self.event_loop = self._executor_loop_overlap if enable_overlap_scheduler else self._executor_loop
self.event_loop = self._executor_loop if disable_overlap_scheduler else self._executor_loop_overlap

if is_trace_enabled("TLLM_TRACE_EXECUTOR_LOOP"):
self.event_loop = trace_func(self.event_loop)
Expand Down Expand Up @@ -1975,7 +1975,7 @@ def _handle_responses(self):
# If request is in transmission, so we don't need to emit a response
# Also, for the first iteration with overlap, we should skip since first token has already been emitted by context server
if request.is_disagg_generation_transmission_in_progress or (
self.enable_overlap_scheduler
not self.disable_overlap_scheduler
and request.py_decoding_iter <= 1):
new_active_requests.append(request)
continue
Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ def create_py_executor(executor_config: ExecutorConfig,
# PyTorchModelEngine modifies these fields, update them to executor_config
max_seq_len = model_engine.max_seq_len
origin_seq_len = max_seq_len
if pytorch_backend_config.enable_overlap_scheduler:
if not pytorch_backend_config.disable_overlap_scheduler:
max_seq_len = model_engine.max_seq_len + 1
if spec_config is not None:
max_seq_len += spec_config.max_draft_tokens
Expand Down
1 change: 0 additions & 1 deletion tensorrt_llm/bench/benchmark/utils/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,6 @@ def get_settings(params: dict, dataset_metadata: DatasetMetadata, model: str,
pyt_options = {
"use_cuda_graph": True,
"cuda_graph_padding_enabled": True,
"enable_overlap_scheduler": True,
"kv_cache_dtype": kv_cache_dtype,
"cuda_graph_max_batch_size": max_batch_size,
}
Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm/commands/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def main(ctx, model: str, tokenizer: Optional[str], log_level: str,
backend = None
pytorch_backend_config = None
if backend == "pytorch":
pytorch_backend_config = PyTorchConfig(enable_overlap_scheduler=True)
pytorch_backend_config = PyTorchConfig()

llm_args = {
"model": model,
Expand Down
3 changes: 1 addition & 2 deletions tensorrt_llm/commands/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,7 @@ def get_llm_args(model: str,
kv_cache_config = KvCacheConfig(
free_gpu_memory_fraction=free_gpu_memory_fraction)

pytorch_backend_config = PyTorchConfig(
enable_overlap_scheduler=True) if backend == "pytorch" else None
pytorch_backend_config = PyTorchConfig() if backend == "pytorch" else None
dynamic_batch_config = DynamicBatchConfig(
enable_batch_size_tuning=True,
enable_max_num_tokens_tuning=False,
Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm/executor/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ def _enqueue_request(self, request: GenerationRequest) -> int:
context_phase_params = request.disaggregated_params.get_context_phase_params(
)

is_overlap_enabled = self._is_pytorch_backend and self._executor_config.pytorch_backend_config.enable_overlap_scheduler
is_overlap_enabled = self._is_pytorch_backend and not self._executor_config.pytorch_backend_config.disable_overlap_scheduler
if is_overlap_enabled:
is_disaggregated = self.engine.kv_cache_transceiver is not None
if is_disaggregated and (
Expand Down
4 changes: 2 additions & 2 deletions tensorrt_llm/scaffolding/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,11 +136,11 @@ def init_with_new_llm(
max_batch_size: int = 32,
max_num_tokens: int = 4096,
kv_cache_free_gpu_memory_fraction: float = 0.9,
enable_overlap_scheduler: bool = True,
disable_overlap_scheduler: bool = False,
):
pytorch_backend_config = PyTorchConfig(
mixed_decoder=True,
enable_overlap_scheduler=enable_overlap_scheduler,
disable_overlap_scheduler=disable_overlap_scheduler,
)
kv_cache_config = KvCacheConfig(
free_gpu_memory_fraction=kv_cache_free_gpu_memory_fraction, )
Expand Down
Loading