-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[#7136][feat] trtllm-serve + autodeploy integration #7141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#7136][feat] trtllm-serve + autodeploy integration #7141
Conversation
Signed-off-by: Suyog Gupta <[email protected]>
📝 WalkthroughWalkthroughAdds an AutoDeploy backend branch to the serve CLI runtime path, a pre-validator for Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant CLI as trtllm-serve CLI
participant Serve as serve.py
participant AutoDeploy as AutoDeployLLM
participant PyTorch as PyTorch LLM
participant Server as HTTP Server
User->>CLI: trtllm-serve --backend _autodeploy ...
CLI->>Serve: parse args
Serve->>Serve: get_llm_args()
alt backend == "_autodeploy"
Serve->>Serve: llm_args.pop("build_config", None)
Serve->>Serve: llm_args["kv_cache_config"].enable_block_reuse = False
Serve->>AutoDeploy: AutoDeployLLM(**llm_args)
AutoDeploy-->>Serve: AutoDeploy LLM instance
else backend == "pytorch"
Serve->>PyTorch: instantiate PyTorch LLM (existing flow)
PyTorch-->>Serve: PyTorch LLM instance
end
Serve->>Server: start HTTP server with LLM
User->>Server: /v1/chat/completions
Server-->>User: response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
Signed-off-by: Suyog Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
277-286
: Max-seq-len validator: confirm intent vs. CLI docs; simplify default lookupThe fallback to
AutoDeployConfig
’s default whenmax_seq_len is None
may contradict the serve CLI help (“If unspecified, deduced from the model config”). If the desired behavior for AutoDeploy is “use model config if unspecified,” consider deferring resolution to the factory/model, or document that AutoDeploy uses a concrete default (currently 512).Two actionable tweaks:
- Use
cls.model_fields
to fetch the default, which reads better with inheritance.- Optionally guard against accidental string inputs (pydantic will coerce, but explicitness helps).
Minimal diff:
- def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any: + def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any: if value is None: - # Fallback to the AutoDeployConfig default when not provided - return AutoDeployConfig.model_fields["max_seq_len"].get_default( + # Fallback to the class default when not provided + return cls.model_fields["max_seq_len"].get_default( call_default_factory=True ) return valueIf the intent is “prefer model-config value when available,” I can propose a small change to resolve
max_seq_len
duringcreate_factory()
by querying the loaded model config when the field is None. Do you want that?tensorrt_llm/commands/serve.py (3)
113-115
: Normalize backend consistently and keep ‘trt’ explicit to match CLI choicesYou convert unknown backends to
None
, which works with the fallback path, but the CLI accepts "trt". Keep “trt” explicit to reduce ambiguity and aid logging/telemetry.Apply this diff:
- backend = backend if backend in ["pytorch", "_autodeploy"] else None + backend = backend if backend in ["pytorch", "trt", "_autodeploy"] else "trt"This preserves the TensorRT path when users pass “trt” (or when an unknown value is provided, we default to “trt” rather than
None
). If you prefer strictness, raise on unknown instead.Also applies to: 144-145
168-172
: Avoid mutating args in place; replace print with logger; make build_config removal safe
- Bare
logger
for consistency.del llm_args["build_config"]
can KeyError if callers pre-strip it; preferpop(..., None)
.- Consider logging only the backend selection (args may contain large objects or sensitive paths).
Apply this diff:
- elif backend == '_autodeploy': - print(f"Using AutoDeploy backend with args: {llm_args}") - del llm_args["build_config"] - llm = AutoDeployLLM(**llm_args) + elif backend == '_autodeploy': + logger.info("Using AutoDeploy backend") + llm_args.pop("build_config", None) + llm = AutoDeployLLM(**llm_args)
212-215
: Update help text to include ‘trt’ behavior for clarityThe choices include “trt” but the help text doesn’t document it.
Apply this diff:
-@click.option("--backend", - type=click.Choice(["pytorch", "trt", "_autodeploy"]), - default="pytorch", - help="Set to 'pytorch' for pytorch path and '_autodeploy' for autodeploy path. Default is pytorch path.") +@click.option("--backend", + type=click.Choice(["pytorch", "trt", "_autodeploy"]), + default="pytorch", + help="Backend to use: 'pytorch' for PyTorch, 'trt' for TensorRT-LLM engine, or '_autodeploy' for AutoDeploy. Default: 'pytorch'.")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/shim/demollm.py
(1 hunks)tensorrt_llm/commands/serve.py
(5 hunks)
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/commands/serve.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/commands/serve.py (1)
15-15
: Import of AutoDeployLLM looks fineAlias is clear and avoids collisions with other LLM classes. No action needed.
Signed-off-by: Suyog Gupta <[email protected]>
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
Outdated
Show resolved
Hide resolved
cc; @Fridah-nv for doc changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Signed-off-by: Suyog Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (6)
15-17
: Tighten wording for bullets under the launch commandMinor style/clarity improvements and consistent terminology.
-- `model`: HF name or local path -- `--backend _autodeploy`: uses AutoDeploy runtime +- `model`: Hugging Face model name or a local path +- `--backend _autodeploy`: uses the AutoDeploy runtime
18-18
: Add a note about differing portsReaders may run the server on a non-default port; call this out next to the curl example.
Once the server is ready, test with an OpenAI-compatible request: +If you run the server on a different port (for example via `--port`), adjust the URL accordingly.
21-29
: Make curl example fail fast on HTTP errors for easier debuggingAdding
-Ssf
surfaces server-side errors immediately, improving UX when the endpoint or payload is misconfigured.-curl -s http://localhost:8000/v1/chat/completions \ +curl -Ssf http://localhost:8000/v1/chat/completions \
44-62
: Document optional max_seq_len override to match backend behaviorPer the backend changes,
max_seq_len
defaults are ensured during validation. Exposing an optional override here will help users understand how to tune it.# Compilation backend for AutoDeploy compile_backend: torch-opt # options: torch-simple, torch-compile, torch-cudagraph, torch-opt + +# Sequence length (optional) +# If omitted, AutoDeploy uses its configured default. +# Uncomment to override: +# max_seq_len: 4096
66-68
: Polish “Limitations” phrasing (docs tone + clarity)Use formal tone and expand “WIP”.
-- KV cache block reuse is disabled automatically for AutoDeploy backend -- AutoDeploy backend doesn't yet support disaggregated serving. WIP +- KV‑cache block reuse is automatically disabled when using the AutoDeploy backend. +- The AutoDeploy backend does not yet support disaggregated serving (work in progress).
61-61
: Optional: Mention dependency note for flashinferIf
attn_backend: flashinfer
requires an extra install, a brief note/link would preempt setup issues.I can add a one-liner with the install command or link if you confirm the exact dependency path/version you want to recommend.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
(1 hunks)docs/source/torch/auto_deploy/auto-deploy.md
(1 hunks)tensorrt_llm/commands/serve.py
(5 hunks)tests/integration/defs/accuracy/test_llm_api_autodeploy.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/commands/serve.py
🧰 Additional context used
🪛 LanguageTool
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
[grammar] ~15-~15: There might be a mistake here.
Context: ... \ ``` - model
: HF name or local path - `--backend _autodeploy`: uses AutoDeploy runtime Once the serv...
(QB_NEW_EN)
[grammar] ~66-~66: There might be a mistake here.
Context: ...e block reuse is disabled automatically for AutoDeploy backend - AutoDeploy backend...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ...t yet support disaggregated serving. WIP - For best performance: - Prefer `compil...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ...ted serving. WIP - For best performance: - Prefer compile_backend: torch-opt
- ...
(QB_NEW_EN)
🪛 GitHub Actions: Release Checks
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
[error] 1-1: end-of-file-fixer: fixed trailing newline at end of file.
🔇 Additional comments (1)
docs/source/torch/auto_deploy/auto-deploy.md (1)
62-62
: Nav link addition looks good; verify path resolution in built docsThe relative link target appears correct. Please confirm the sidebar renders this entry under Advanced Usage after the docs build.
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/commands/serve.py (1)
332-351
: Wiremoe_cluster_parallel_size
throughget_llm_args
inserve.py
The CLI flag
--cluster_size
is passed intoserve()
asmoe_cluster_parallel_size=cluster_size
, but theget_llm_args
function intensorrt_llm/commands/serve.py
doesn’t accept or propagate that parameter, so it’s silently dropped.• File:
tensorrt_llm/commands/serve.py
– At theget_llm_args
signature (around line 74), add the new parameter.
– Inside the returnedllm_args
dict, include the key/value pair.Proposed patch:
def get_llm_args(model: str, tokenizer: Optional[str] = None, backend: str = "pytorch", max_beam_width: int = BuildConfig.max_beam_width, max_batch_size: int = BuildConfig.max_batch_size, max_num_tokens: int = BuildConfig.max_num_tokens, max_seq_len: int = BuildConfig.max_seq_len, tensor_parallel_size: int = 1, pipeline_parallel_size: int = 1, moe_expert_parallel_size: Optional[int] = None, + moe_cluster_parallel_size: Optional[int] = None, gpus_per_node: Optional[int] = None, free_gpu_memory_fraction: Optional[float] = None, mamba_ssm_cache_dtype: str = "auto", num_postprocess_workers: int = 0, trust_remote_code: bool = False, reasoning_parser: Optional[str] = None, fail_fast_on_attention_window_too_large: bool = False, **llm_args_extra_dict: Any): … return { "pipeline_parallel_size": pipeline_parallel_size, "moe_expert_parallel_size": moe_expert_parallel_size, + "moe_cluster_parallel_size": moe_cluster_parallel_size, "gpus_per_node": gpus_per_node, "free_gpu_memory_fraction": free_gpu_memory_fraction, … }
♻️ Duplicate comments (1)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)
10-13
: Fix trailing backslash in the Quick start command (copy/paste breaks).The last line ends with a continuation backslash but nothing follows, which will cause a shell error on copy/paste.
Apply this diff:
trtllm-serve \ meta-llama/Llama-3.1-8B-Instruct \ - --backend _autodeploy \ + --backend _autodeploy
🧹 Nitpick comments (5)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)
66-73
: Tighten wording in Limitations and tips for clarity and tone.Minor grammar/style nits. Improves readability and aligns with formal doc tone.
-- KV cache block reuse is disabled automatically for AutoDeploy backend -- AutoDeploy backend doesn't yet support disaggregated serving. WIP +- KV cache block reuse is disabled automatically for the AutoDeploy backend. +- The AutoDeploy backend doesn't yet support disaggregated serving (WIP). - For best performance: - Prefer `compile_backend: torch-opt` - Use `attn_backend: flashinfer` - Set realistic `cuda_graph_batch_sizes` that match expected traffic - Tune `free_mem_ratio` to 0.8–0.9tensorrt_llm/commands/serve.py (4)
113-113
: Backend normalization drops 'trt' explicitly; prefer explicit mapping for clarity.Currently, any non-['pytorch','_autodeploy'] backend (including 'trt') becomes None and falls into the TRT path implicitly. This works but is confusing given the CLI exposes 'trt'.
- backend = backend if backend in ["pytorch", "_autodeploy"] else None + # Normalize backend: keep explicit 'trt' instead of collapsing to None + backend = backend if backend in ["pytorch", "trt", "_autodeploy"] else NoneOptionally, make the TRT branch explicit in launch_server (see separate comment).
215-220
: Backend CLI help omits 'trt'; align help text with choices.The Click choices include 'trt' but the help text only mentions 'pytorch' and '_autodeploy'.
- "Set to 'pytorch' for pytorch path and '_autodeploy' for autodeploy path. Default is pytorch path." + "Set to 'pytorch' for the PyTorch path, 'trt' for the TensorRT engine path, or '_autodeploy' for the AutoDeploy path. Default is 'pytorch'."
164-176
: Make TRT path explicit in launch_server for readability.Optional: treat 'trt' explicitly rather than piggybacking on the fallback branch. This matches the CLI surface and future-proofs validation.
- if backend == 'pytorch': + if backend == 'pytorch': llm = PyTorchLLM(**llm_args) - elif backend == '_autodeploy': + elif backend == '_autodeploy': ... - else: + elif backend == 'trt' or backend is None: llm = LLM(**llm_args) + else: + raise ValueError(f"Unsupported backend: {backend}")
37-71
: Signal handler comment vs implementation mismatch; clarify and keep it minimal.Comment says “Using print for safety” but code uses logger calls inside the handler. In Python handlers run on the main thread but keeping them minimal is still advisable. Either switch to
- """Signal handler to clean up the child process.""" - global _child_p_global - if _child_p_global and _child_p_global.poll() is None: - # Using print for safety in signal handlers - logger.info( + """Signal handler to clean up the child process (minimal work).""" + global _child_p_global + if _child_p_global and _child_p_global.poll() is None: + # Using logger here is acceptable in Python signal handlers (main thread), but keep work minimal. + logger.info(
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
(1 hunks)tensorrt_llm/commands/serve.py
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/commands/serve.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/commands/serve.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/commands/serve.py
🪛 LanguageTool
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
[grammar] ~15-~15: There might be a mistake here.
Context: ... \ ``` - model
: HF name or local path - `--backend _autodeploy`: uses AutoDeploy runtime Once the serv...
(QB_NEW_EN)
[grammar] ~66-~66: There might be a mistake here.
Context: ...e block reuse is disabled automatically for AutoDeploy backend - AutoDeploy backend...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ...t yet support disaggregated serving. WIP - For best performance: - Prefer `compil...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ...ted serving. WIP - For best performance: - Prefer compile_backend: torch-opt
- ...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)
74-78
: Cross-link targets verifiedThe referenced “See also” links point to existing files in the docs source:
- docs/source/torch/auto_deploy/auto-deploy.md
- docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
No further action required.
tensorrt_llm/commands/serve.py (2)
17-17
: Importing AutoDeployLLM is correct and scoped.Alias-based import reads cleanly and avoids name collision with the TRT LLM.
672-695
: Signal handlers are registered and restored correctly.Good defensive pattern: save originals, register temporary handlers, restore in finally.
Signed-off-by: Suyog Gupta <[email protected]>
/bot run |
PR_Github #16086 [ run ] triggered by Bot |
PR_Github #16086 [ run ] completed with state |
/bot run |
PR_Github #16129 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/commands/serve.py (2)
343-344
: TypeError:get_llm_args()
doesn’t acceptmoe_cluster_parallel_size
but CLI passes it.This will throw “unexpected keyword argument” and likely explains the failed pipeline. Add the parameter and forward it in the returned dict.
Apply this diff to the function signature and args dict:
def get_llm_args(model: str, @@ - moe_expert_parallel_size: Optional[int] = None, + moe_expert_parallel_size: Optional[int] = None, + moe_cluster_parallel_size: Optional[int] = None, @@ llm_args = { @@ "moe_expert_parallel_size": moe_expert_parallel_size, + "moe_cluster_parallel_size": + moe_cluster_parallel_size,No change needed at the call site after this; the current invocation will become valid.
1-1
: Include NVIDIA SPDX Header inserve.py
Please prepend the following line at the very top of
tensorrt_llm/commands/serve.py
, before any imports:+ # SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. import asyncio
This matches the style used across the repo for Python source files (e.g.
math_utils.py
,functional.py
) and updates the year range through 2025.
♻️ Duplicate comments (2)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)
9-13
: Trailing backslash fix looks good — copy/paste works now.The command now terminates cleanly and won’t error in shells.
tensorrt_llm/commands/serve.py (1)
168-175
: Autodeploy: potential AttributeError on kv_cache_config when YAML merges.If extras set
kv_cache_config
to a dict or None,enable_block_reuse
attribute assignment will fail. Normalize first.Apply this diff:
- # TODO(https://github.com/NVIDIA/TensorRT-LLM/issues/7142): - # AutoDeploy does not support cache reuse yet. - llm_args["kv_cache_config"].enable_block_reuse = False + # AutoDeploy does not support cache reuse yet. + kv_cfg = llm_args.get("kv_cache_config") + if isinstance(kv_cfg, dict): + kv_cfg = KvCacheConfig(**kv_cfg) + if kv_cfg is None: + kv_cfg = KvCacheConfig() + kv_cfg.enable_block_reuse = False + llm_args["kv_cache_config"] = kv_cfg
🧹 Nitpick comments (6)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (2)
15-16
: Spell out “HF” on first use.Minor clarity nit. Consider “Hugging Face (HF) name or local path.”
Apply this diff:
-- `model`: HF name or local path +- `model`: Hugging Face (HF) name or local path
66-73
: Polish the “Limitations and tips” phrasing for consistency.Tighten grammar, remove “WIP” from user-facing docs, and end list items with periods.
Apply this diff:
-- KV cache block reuse is disabled automatically for AutoDeploy backend -- AutoDeploy backend doesn't yet support disaggregated serving. WIP - - For best performance: - - Prefer `compile_backend: torch-opt` - - Use `attn_backend: flashinfer` - - Set realistic `cuda_graph_batch_sizes` that match expected traffic - - Tune `free_mem_ratio` to 0.8–0.9 +- KV cache block reuse is disabled automatically for the AutoDeploy backend. +- The AutoDeploy backend does not yet support disaggregated serving. +- For best performance: + - Prefer `compile_backend: torch-opt`. + - Use `attn_backend: flashinfer`. + - Set realistic `cuda_graph_batch_sizes` that match expected traffic. + - Tune `free_mem_ratio` to 0.8–0.9.tensorrt_llm/commands/serve.py (4)
17-17
: Import AutoDeploy LLM lazily to avoid hard dependency at module import time.Keeps the PyTorch/TRT paths usable even if the AutoDeploy stack isn’t available.
Apply this diff:
-from tensorrt_llm._torch.auto_deploy.llm import LLM as AutoDeployLLM @@ - elif backend == '_autodeploy': + elif backend == '_autodeploy': + # Import lazily to avoid hard dependency when not used. + from tensorrt_llm._torch.auto_deploy.llm import LLM as AutoDeployLLM # AutoDeploy does not support build_config llm_args.pop("build_config", None) - # TODO(https://github.com/NVIDIA/TensorRT-LLM/issues/7142): - # AutoDeploy does not support cache reuse yet. - llm_args["kv_cache_config"].enable_block_reuse = False + # AutoDeploy does not support cache reuse yet. + # Be robust to YAML merges that may set kv_cache_config to dict/None. + kv_cfg = llm_args.get("kv_cache_config") + if isinstance(kv_cfg, dict): + kv_cfg = KvCacheConfig(**kv_cfg) + if kv_cfg is None: + kv_cfg = KvCacheConfig() + kv_cfg.enable_block_reuse = False + llm_args["kv_cache_config"] = kv_cfg llm = AutoDeployLLM(**llm_args)Also applies to: 168-175
113-114
: Preserve explicit 'trt' backend instead of coercing to None.Keeping the string improves clarity in logs/config echo while still falling through to the TRT path.
Apply this diff:
- backend = backend if backend in ["pytorch", "_autodeploy"] else None + backend = backend if backend in ["pytorch", "_autodeploy", "trt"] else None
214-220
: Help text: mention the TRT path explicitly for completeness.Make it clear what “trt” does.
Apply this diff:
- "Set to 'pytorch' for pytorch path and '_autodeploy' for autodeploy path. Default is pytorch path." + "Set to 'pytorch' for the PyTorch path, 'trt' for the TensorRT engine path, and '_autodeploy' for the AutoDeploy path. Default is PyTorch."
265-268
: Align--kv_cache_free_gpu_memory_fraction
Defaults Across CLI and DocumentationThe CLI currently defaults
--kv_cache_free_gpu_memory_fraction
to 0.9 (serve.py:265–266; eval.py:85–87), but several documentation examples use 0.8 as the default fraction:
- docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (lines 117–118) shows
--kv_cache_free_gpu_memory_fraction 0.8
- docs/source/torch/auto_deploy/advanced/workflow.md (line 22) and serving/benchmarking guides set
free_mem_ratio: 0.8
- docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (lines 55, 72) recommend tuning to 0.8–0.9 but use 0.8 in examples
At the same time, most quick-start recipes (llama3.3, llama4-scout) and tech blogs use 0.9, and scaffolding code defaults to 0.9 (worker.py:151).
To ensure a single source of truth, please choose one of the following:
Option A: Update Documentation to 0.9
– Change all examples that currently use 0.8 to 0.9
– In tuning guides, note that while the default is 0.9, users may adjust within the 0.8–0.9 range as neededOption B: Change CLI Default to 0.8
– Modify the default in serve.py and eval.py from 0.9 to 0.8
– Update scaffolding/worker.py’s default to 0.8 for consistency
– Clarify in docs that the CLI default is 0.8, with 0.9 still acceptable for higher GPU headroomPlease implement one of these approaches so that code defaults and documentation examples are in sync.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
(1 hunks)tensorrt_llm/commands/serve.py
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/commands/serve.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/commands/serve.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/commands/serve.py
🪛 LanguageTool
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md
[grammar] ~15-~15: There might be a mistake here.
Context: ...oy ``` - model
: HF name or local path - `--backend _autodeploy`: uses AutoDeploy runtime Once the serv...
(QB_NEW_EN)
[grammar] ~66-~66: There might be a mistake here.
Context: ...e block reuse is disabled automatically for AutoDeploy backend - AutoDeploy backend...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ...t yet support disaggregated serving. WIP - For best performance: - Prefer `compil...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ...ted serving. WIP - For best performance: - Prefer compile_backend: torch-opt
- ...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)
74-77
: Add cross-link to the Compile Backends section in the Support MatrixPaths and anchors have been verified in
docs/source/torch/auto_deploy/support_matrix.md
(heading “Compile Backends” → anchor#compile-backends
) and the relative path from this file is correct. Please apply the following update:## See also - [AutoDeploy overview](../auto-deploy.md) - [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md) + - [Compilation backends support matrix](../support_matrix.md#compile-backends)
PR_Github #16129 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for doc and trtllm-serve part changes.
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.