Skip to content

Conversation

suyoggupta
Copy link
Collaborator

@suyoggupta suyoggupta commented Aug 21, 2025

Summary by CodeRabbit

  • New Features
    • Added an AutoDeploy serving backend to the CLI and runtime; --backend now accepts "_autodeploy" alongside "pytorch" and "trt". Autodeploy path is integrated at launch and will disable KV-cache block reuse automatically.
  • Bug Fixes
    • Validator now ensures a concrete default for max sequence length when not specified.
  • Documentation
    • New "Serving with trtllm-serve" guide and navigation link added to AutoDeploy docs.
  • Tests
    • Clarifying comments about AutoDeploy KV-cache behavior added to integration tests.
  • Chores
    • Minor formatting cleanup with no user-facing impact.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

📝 Walkthrough

Walkthrough

Adds an AutoDeploy backend branch to the serve CLI runtime path, a pre-validator for max_seq_len in LlmArgs, a serving guide and nav link in docs, small test comment additions, and a formatting tweak in a custom ops file.

Changes

Cohort / File(s) Summary of Changes
Serve CLI: AutoDeploy backend
tensorrt_llm/commands/serve.py
Added support for backend _autodeploy: import AutoDeployLLM, restrict allowed backends, remove build_config for autodeploy, disable KV cache block reuse (kv_cache_config.enable_block_reuse = False), and instantiate AutoDeployLLM(**llm_args). Updated CLI backend choices/help.
LlmArgs validation
tensorrt_llm/_torch/auto_deploy/llm_args.py
Added ensure_max_seq_len pre-validation (@field_validator("max_seq_len", mode="before")) to substitute the AutoDeployConfig default when max_seq_len is None.
Docs: Serving guide and nav
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md, docs/source/torch/auto_deploy/auto-deploy.md
New "Serving with trtllm-serve" guide documenting --backend _autodeploy usage, config YAML examples, limitations and tips; added navigation link under Advanced Usage.
Custom ops formatting
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
Removed an extra blank line; no behavioral changes.
Tests: comments
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
Inserted explanatory comments noting AutoDeploy does not support cache reuse yet; no functional changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant CLI as trtllm-serve CLI
  participant Serve as serve.py
  participant AutoDeploy as AutoDeployLLM
  participant PyTorch as PyTorch LLM
  participant Server as HTTP Server

  User->>CLI: trtllm-serve --backend _autodeploy ...
  CLI->>Serve: parse args
  Serve->>Serve: get_llm_args()
  alt backend == "_autodeploy"
    Serve->>Serve: llm_args.pop("build_config", None)
    Serve->>Serve: llm_args["kv_cache_config"].enable_block_reuse = False
    Serve->>AutoDeploy: AutoDeployLLM(**llm_args)
    AutoDeploy-->>Serve: AutoDeploy LLM instance
  else backend == "pytorch"
    Serve->>PyTorch: instantiate PyTorch LLM (existing flow)
    PyTorch-->>Serve: PyTorch LLM instance
  end
  Serve->>Server: start HTTP server with LLM
  User->>Server: /v1/chat/completions
  Server-->>User: response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • chzblych
  • kaiyux
  • Fridah-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@suyoggupta suyoggupta marked this pull request as draft August 21, 2025 21:47
Signed-off-by: Suyog Gupta <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

277-286: Max-seq-len validator: confirm intent vs. CLI docs; simplify default lookup

The fallback to AutoDeployConfig’s default when max_seq_len is None may contradict the serve CLI help (“If unspecified, deduced from the model config”). If the desired behavior for AutoDeploy is “use model config if unspecified,” consider deferring resolution to the factory/model, or document that AutoDeploy uses a concrete default (currently 512).

Two actionable tweaks:

  • Use cls.model_fields to fetch the default, which reads better with inheritance.
  • Optionally guard against accidental string inputs (pydantic will coerce, but explicitness helps).

Minimal diff:

-    def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any:
+    def ensure_max_seq_len(cls, value: Any, info: ValidationInfo) -> Any:
         if value is None:
-            # Fallback to the AutoDeployConfig default when not provided
-            return AutoDeployConfig.model_fields["max_seq_len"].get_default(
+            # Fallback to the class default when not provided
+            return cls.model_fields["max_seq_len"].get_default(
                 call_default_factory=True
             )
         return value

If the intent is “prefer model-config value when available,” I can propose a small change to resolve max_seq_len during create_factory() by querying the loaded model config when the field is None. Do you want that?

tensorrt_llm/commands/serve.py (3)

113-115: Normalize backend consistently and keep ‘trt’ explicit to match CLI choices

You convert unknown backends to None, which works with the fallback path, but the CLI accepts "trt". Keep “trt” explicit to reduce ambiguity and aid logging/telemetry.

Apply this diff:

-    backend = backend if backend in ["pytorch", "_autodeploy"] else None
+    backend = backend if backend in ["pytorch", "trt", "_autodeploy"] else "trt"

This preserves the TensorRT path when users pass “trt” (or when an unknown value is provided, we default to “trt” rather than None). If you prefer strictness, raise on unknown instead.

Also applies to: 144-145


168-172: Avoid mutating args in place; replace print with logger; make build_config removal safe

  • Bare print is noisy; use logger for consistency.
  • del llm_args["build_config"] can KeyError if callers pre-strip it; prefer pop(..., None).
  • Consider logging only the backend selection (args may contain large objects or sensitive paths).

Apply this diff:

-    elif backend == '_autodeploy':
-        print(f"Using AutoDeploy backend with args: {llm_args}")
-        del llm_args["build_config"]
-        llm = AutoDeployLLM(**llm_args)
+    elif backend == '_autodeploy':
+        logger.info("Using AutoDeploy backend")
+        llm_args.pop("build_config", None)
+        llm = AutoDeployLLM(**llm_args)

212-215: Update help text to include ‘trt’ behavior for clarity

The choices include “trt” but the help text doesn’t document it.

Apply this diff:

-@click.option("--backend",
-              type=click.Choice(["pytorch", "trt", "_autodeploy"]),
-              default="pytorch",
-              help="Set to 'pytorch' for pytorch path and '_autodeploy' for autodeploy path. Default is pytorch path.")
+@click.option("--backend",
+              type=click.Choice(["pytorch", "trt", "_autodeploy"]),
+              default="pytorch",
+              help="Backend to use: 'pytorch' for PyTorch, 'trt' for TensorRT-LLM engine, or '_autodeploy' for AutoDeploy. Default: 'pytorch'.")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f7c597e and e9a1f59.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/commands/serve.py (1)

15-15: Import of AutoDeployLLM looks fine

Alias is clear and avoids collisions with other LLM classes. No action needed.

@suyoggupta suyoggupta changed the title [AutoDeploy] trtllm-serve + autodeploy integration [#7136][AutoDeploy] trtllm-serve + autodeploy integration Aug 21, 2025
@suyoggupta suyoggupta marked this pull request as ready for review August 21, 2025 22:30
@suyoggupta suyoggupta requested a review from a team as a code owner August 21, 2025 22:50
@suyoggupta suyoggupta requested review from chzblych and kaiyux August 21, 2025 22:50
@suyoggupta
Copy link
Collaborator Author

cc; @Fridah-nv for doc changes

Copy link
Collaborator

@2ez4bz 2ez4bz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Signed-off-by: Suyog Gupta <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (6)

15-17: Tighten wording for bullets under the launch command

Minor style/clarity improvements and consistent terminology.

-- `model`: HF name or local path
-- `--backend _autodeploy`: uses AutoDeploy runtime
+- `model`: Hugging Face model name or a local path
+- `--backend _autodeploy`: uses the AutoDeploy runtime

18-18: Add a note about differing ports

Readers may run the server on a non-default port; call this out next to the curl example.

 Once the server is ready, test with an OpenAI-compatible request:
+If you run the server on a different port (for example via `--port`), adjust the URL accordingly.

21-29: Make curl example fail fast on HTTP errors for easier debugging

Adding -Ssf surfaces server-side errors immediately, improving UX when the endpoint or payload is misconfigured.

-curl -s http://localhost:8000/v1/chat/completions \
+curl -Ssf http://localhost:8000/v1/chat/completions \

44-62: Document optional max_seq_len override to match backend behavior

Per the backend changes, max_seq_len defaults are ensured during validation. Exposing an optional override here will help users understand how to tune it.

 # Compilation backend for AutoDeploy
 compile_backend: torch-opt  # options: torch-simple, torch-compile, torch-cudagraph, torch-opt
+
+# Sequence length (optional)
+# If omitted, AutoDeploy uses its configured default.
+# Uncomment to override:
+# max_seq_len: 4096

66-68: Polish “Limitations” phrasing (docs tone + clarity)

Use formal tone and expand “WIP”.

-- KV cache block reuse is disabled automatically for AutoDeploy backend
-- AutoDeploy backend doesn't yet support disaggregated serving. WIP
+- KV‑cache block reuse is automatically disabled when using the AutoDeploy backend.
+- The AutoDeploy backend does not yet support disaggregated serving (work in progress).

61-61: Optional: Mention dependency note for flashinfer

If attn_backend: flashinfer requires an extra install, a brief note/link would preempt setup issues.

I can add a one-liner with the install command or link if you confirm the exact dependency path/version you want to recommend.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e9a1f59 and b3955bc.

📒 Files selected for processing (4)
  • docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1 hunks)
  • docs/source/torch/auto_deploy/auto-deploy.md (1 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/commands/serve.py
🧰 Additional context used
🪛 LanguageTool
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md

[grammar] ~15-~15: There might be a mistake here.
Context: ... \ ``` - model: HF name or local path - `--backend _autodeploy`: uses AutoDeploy runtime Once the serv...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ...e block reuse is disabled automatically for AutoDeploy backend - AutoDeploy backend...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ...t yet support disaggregated serving. WIP - For best performance: - Prefer `compil...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ...ted serving. WIP - For best performance: - Prefer compile_backend: torch-opt - ...

(QB_NEW_EN)

🪛 GitHub Actions: Release Checks
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md

[error] 1-1: end-of-file-fixer: fixed trailing newline at end of file.

🔇 Additional comments (1)
docs/source/torch/auto_deploy/auto-deploy.md (1)

62-62: Nav link addition looks good; verify path resolution in built docs

The relative link target appears correct. Please confirm the sidebar renders this entry under Advanced Usage after the docs build.

@suyoggupta suyoggupta changed the title [#7136][AutoDeploy] trtllm-serve + autodeploy integration [#7136][feat] trtllm-serve + autodeploy integration Aug 21, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/commands/serve.py (1)

332-351: Wire moe_cluster_parallel_size through get_llm_args in serve.py

The CLI flag --cluster_size is passed into serve() as moe_cluster_parallel_size=cluster_size, but the get_llm_args function in tensorrt_llm/commands/serve.py doesn’t accept or propagate that parameter, so it’s silently dropped.

• File: tensorrt_llm/commands/serve.py
– At the get_llm_args signature (around line 74), add the new parameter.
– Inside the returned llm_args dict, include the key/value pair.

Proposed patch:

 def get_llm_args(model: str,
                  tokenizer: Optional[str] = None,
                  backend: str = "pytorch",
                  max_beam_width: int = BuildConfig.max_beam_width,
                  max_batch_size: int = BuildConfig.max_batch_size,
                  max_num_tokens: int = BuildConfig.max_num_tokens,
                  max_seq_len: int = BuildConfig.max_seq_len,
                  tensor_parallel_size: int = 1,
                  pipeline_parallel_size: int = 1,
                  moe_expert_parallel_size: Optional[int] = None,
+                 moe_cluster_parallel_size: Optional[int] = None,
                  gpus_per_node: Optional[int] = None,
                  free_gpu_memory_fraction: Optional[float] = None,
                  mamba_ssm_cache_dtype: str = "auto",
                  num_postprocess_workers: int = 0,
                  trust_remote_code: bool = False,
                  reasoning_parser: Optional[str] = None,
                  fail_fast_on_attention_window_too_large: bool = False,
                  **llm_args_extra_dict: Any):
     …
     return {
         "pipeline_parallel_size": pipeline_parallel_size,
         "moe_expert_parallel_size": moe_expert_parallel_size,
+        "moe_cluster_parallel_size": moe_cluster_parallel_size,
         "gpus_per_node": gpus_per_node,
         "free_gpu_memory_fraction": free_gpu_memory_fraction,
         …
     }
♻️ Duplicate comments (1)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)

10-13: Fix trailing backslash in the Quick start command (copy/paste breaks).

The last line ends with a continuation backslash but nothing follows, which will cause a shell error on copy/paste.

Apply this diff:

 trtllm-serve \
   meta-llama/Llama-3.1-8B-Instruct \
-  --backend _autodeploy \
+  --backend _autodeploy
🧹 Nitpick comments (5)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)

66-73: Tighten wording in Limitations and tips for clarity and tone.

Minor grammar/style nits. Improves readability and aligns with formal doc tone.

-- KV cache block reuse is disabled automatically for AutoDeploy backend
-- AutoDeploy backend doesn't yet support disaggregated serving. WIP
+- KV cache block reuse is disabled automatically for the AutoDeploy backend.
+- The AutoDeploy backend doesn't yet support disaggregated serving (WIP).
 - For best performance:
   - Prefer `compile_backend: torch-opt`
   - Use `attn_backend: flashinfer`
   - Set realistic `cuda_graph_batch_sizes` that match expected traffic
   - Tune `free_mem_ratio` to 0.8–0.9
tensorrt_llm/commands/serve.py (4)

113-113: Backend normalization drops 'trt' explicitly; prefer explicit mapping for clarity.

Currently, any non-['pytorch','_autodeploy'] backend (including 'trt') becomes None and falls into the TRT path implicitly. This works but is confusing given the CLI exposes 'trt'.

-    backend = backend if backend in ["pytorch", "_autodeploy"] else None
+    # Normalize backend: keep explicit 'trt' instead of collapsing to None
+    backend = backend if backend in ["pytorch", "trt", "_autodeploy"] else None

Optionally, make the TRT branch explicit in launch_server (see separate comment).


215-220: Backend CLI help omits 'trt'; align help text with choices.

The Click choices include 'trt' but the help text only mentions 'pytorch' and '_autodeploy'.

-    "Set to 'pytorch' for pytorch path and '_autodeploy' for autodeploy path. Default is pytorch path."
+    "Set to 'pytorch' for the PyTorch path, 'trt' for the TensorRT engine path, or '_autodeploy' for the AutoDeploy path. Default is 'pytorch'."

164-176: Make TRT path explicit in launch_server for readability.

Optional: treat 'trt' explicitly rather than piggybacking on the fallback branch. This matches the CLI surface and future-proofs validation.

-    if backend == 'pytorch':
+    if backend == 'pytorch':
         llm = PyTorchLLM(**llm_args)
-    elif backend == '_autodeploy':
+    elif backend == '_autodeploy':
         ...
-    else:
+    elif backend == 'trt' or backend is None:
         llm = LLM(**llm_args)
+    else:
+        raise ValueError(f"Unsupported backend: {backend}")

37-71: Signal handler comment vs implementation mismatch; clarify and keep it minimal.

Comment says “Using print for safety” but code uses logger calls inside the handler. In Python handlers run on the main thread but keeping them minimal is still advisable. Either switch to print or update the comment.

-    """Signal handler to clean up the child process."""
-    global _child_p_global
-    if _child_p_global and _child_p_global.poll() is None:
-        # Using print for safety in signal handlers
-        logger.info(
+    """Signal handler to clean up the child process (minimal work)."""
+    global _child_p_global
+    if _child_p_global and _child_p_global.poll() is None:
+        # Using logger here is acceptable in Python signal handlers (main thread), but keep work minimal.
+        logger.info(
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b3955bc and b41071f.

📒 Files selected for processing (2)
  • docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/commands/serve.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/commands/serve.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/commands/serve.py
🪛 LanguageTool
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md

[grammar] ~15-~15: There might be a mistake here.
Context: ... \ ``` - model: HF name or local path - `--backend _autodeploy`: uses AutoDeploy runtime Once the serv...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ...e block reuse is disabled automatically for AutoDeploy backend - AutoDeploy backend...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ...t yet support disaggregated serving. WIP - For best performance: - Prefer `compil...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ...ted serving. WIP - For best performance: - Prefer compile_backend: torch-opt - ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)

74-78: Cross-link targets verified

The referenced “See also” links point to existing files in the docs source:

  • docs/source/torch/auto_deploy/auto-deploy.md
  • docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md

No further action required.

tensorrt_llm/commands/serve.py (2)

17-17: Importing AutoDeployLLM is correct and scoped.

Alias-based import reads cleanly and avoids name collision with the TRT LLM.


672-695: Signal handlers are registered and restored correctly.

Good defensive pattern: save originals, register temporary handlers, restore in finally.

Signed-off-by: Suyog Gupta <[email protected]>
@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16086 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16086 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12097 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16129 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/commands/serve.py (2)

343-344: TypeError: get_llm_args() doesn’t accept moe_cluster_parallel_size but CLI passes it.

This will throw “unexpected keyword argument” and likely explains the failed pipeline. Add the parameter and forward it in the returned dict.

Apply this diff to the function signature and args dict:

 def get_llm_args(model: str,
@@
-                 moe_expert_parallel_size: Optional[int] = None,
+                 moe_expert_parallel_size: Optional[int] = None,
+                 moe_cluster_parallel_size: Optional[int] = None,
@@
     llm_args = {
@@
         "moe_expert_parallel_size":
         moe_expert_parallel_size,
+        "moe_cluster_parallel_size":
+        moe_cluster_parallel_size,

No change needed at the call site after this; the current invocation will become valid.


1-1: Include NVIDIA SPDX Header in serve.py

Please prepend the following line at the very top of tensorrt_llm/commands/serve.py, before any imports:

+ # SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  import asyncio

This matches the style used across the repo for Python source files (e.g. math_utils.py, functional.py) and updates the year range through 2025.

♻️ Duplicate comments (2)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)

9-13: Trailing backslash fix looks good — copy/paste works now.

The command now terminates cleanly and won’t error in shells.

tensorrt_llm/commands/serve.py (1)

168-175: Autodeploy: potential AttributeError on kv_cache_config when YAML merges.

If extras set kv_cache_config to a dict or None, enable_block_reuse attribute assignment will fail. Normalize first.

Apply this diff:

-        # TODO(https://github.com/NVIDIA/TensorRT-LLM/issues/7142):
-        # AutoDeploy does not support cache reuse yet.
-        llm_args["kv_cache_config"].enable_block_reuse = False
+        # AutoDeploy does not support cache reuse yet.
+        kv_cfg = llm_args.get("kv_cache_config")
+        if isinstance(kv_cfg, dict):
+            kv_cfg = KvCacheConfig(**kv_cfg)
+        if kv_cfg is None:
+            kv_cfg = KvCacheConfig()
+        kv_cfg.enable_block_reuse = False
+        llm_args["kv_cache_config"] = kv_cfg
🧹 Nitpick comments (6)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (2)

15-16: Spell out “HF” on first use.

Minor clarity nit. Consider “Hugging Face (HF) name or local path.”

Apply this diff:

-- `model`: HF name or local path
+- `model`: Hugging Face (HF) name or local path

66-73: Polish the “Limitations and tips” phrasing for consistency.

Tighten grammar, remove “WIP” from user-facing docs, and end list items with periods.

Apply this diff:

-- KV cache block reuse is disabled automatically for AutoDeploy backend
-- AutoDeploy backend doesn't yet support disaggregated serving. WIP
- - For best performance:
-   - Prefer `compile_backend: torch-opt`
-   - Use `attn_backend: flashinfer`
-   - Set realistic `cuda_graph_batch_sizes` that match expected traffic
-   - Tune `free_mem_ratio` to 0.8–0.9
+- KV cache block reuse is disabled automatically for the AutoDeploy backend.
+- The AutoDeploy backend does not yet support disaggregated serving.
+- For best performance:
+  - Prefer `compile_backend: torch-opt`.
+  - Use `attn_backend: flashinfer`.
+  - Set realistic `cuda_graph_batch_sizes` that match expected traffic.
+  - Tune `free_mem_ratio` to 0.8–0.9.
tensorrt_llm/commands/serve.py (4)

17-17: Import AutoDeploy LLM lazily to avoid hard dependency at module import time.

Keeps the PyTorch/TRT paths usable even if the AutoDeploy stack isn’t available.

Apply this diff:

-from tensorrt_llm._torch.auto_deploy.llm import LLM as AutoDeployLLM
@@
-    elif backend == '_autodeploy':
+    elif backend == '_autodeploy':
+        # Import lazily to avoid hard dependency when not used.
+        from tensorrt_llm._torch.auto_deploy.llm import LLM as AutoDeployLLM
         # AutoDeploy does not support build_config
         llm_args.pop("build_config", None)
-        # TODO(https://github.com/NVIDIA/TensorRT-LLM/issues/7142):
-        # AutoDeploy does not support cache reuse yet.
-        llm_args["kv_cache_config"].enable_block_reuse = False
+        # AutoDeploy does not support cache reuse yet.
+        # Be robust to YAML merges that may set kv_cache_config to dict/None.
+        kv_cfg = llm_args.get("kv_cache_config")
+        if isinstance(kv_cfg, dict):
+            kv_cfg = KvCacheConfig(**kv_cfg)
+        if kv_cfg is None:
+            kv_cfg = KvCacheConfig()
+        kv_cfg.enable_block_reuse = False
+        llm_args["kv_cache_config"] = kv_cfg
         llm = AutoDeployLLM(**llm_args)

Also applies to: 168-175


113-114: Preserve explicit 'trt' backend instead of coercing to None.

Keeping the string improves clarity in logs/config echo while still falling through to the TRT path.

Apply this diff:

-    backend = backend if backend in ["pytorch", "_autodeploy"] else None
+    backend = backend if backend in ["pytorch", "_autodeploy", "trt"] else None

214-220: Help text: mention the TRT path explicitly for completeness.

Make it clear what “trt” does.

Apply this diff:

-    "Set to 'pytorch' for pytorch path and '_autodeploy' for autodeploy path. Default is pytorch path."
+    "Set to 'pytorch' for the PyTorch path, 'trt' for the TensorRT engine path, and '_autodeploy' for the AutoDeploy path. Default is PyTorch."

265-268: Align --kv_cache_free_gpu_memory_fraction Defaults Across CLI and Documentation

The CLI currently defaults --kv_cache_free_gpu_memory_fraction to 0.9 (serve.py:265–266; eval.py:85–87), but several documentation examples use 0.8 as the default fraction:

  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (lines 117–118) shows --kv_cache_free_gpu_memory_fraction 0.8
  • docs/source/torch/auto_deploy/advanced/workflow.md (line 22) and serving/benchmarking guides set free_mem_ratio: 0.8
  • docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (lines 55, 72) recommend tuning to 0.8–0.9 but use 0.8 in examples

At the same time, most quick-start recipes (llama3.3, llama4-scout) and tech blogs use 0.9, and scaffolding code defaults to 0.9 (worker.py:151).

To ensure a single source of truth, please choose one of the following:

  • Option A: Update Documentation to 0.9
    – Change all examples that currently use 0.8 to 0.9
    – In tuning guides, note that while the default is 0.9, users may adjust within the 0.8–0.9 range as needed

  • Option B: Change CLI Default to 0.8
    – Modify the default in serve.py and eval.py from 0.9 to 0.8
    – Update scaffolding/worker.py’s default to 0.8 for consistency
    – Clarify in docs that the CLI default is 0.8, with 0.9 still acceptable for higher GPU headroom

Please implement one of these approaches so that code defaults and documentation examples are in sync.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b3955bc and 8b1357d.

📒 Files selected for processing (2)
  • docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/commands/serve.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/commands/serve.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/commands/serve.py
🪛 LanguageTool
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md

[grammar] ~15-~15: There might be a mistake here.
Context: ...oy ``` - model: HF name or local path - `--backend _autodeploy`: uses AutoDeploy runtime Once the serv...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ...e block reuse is disabled automatically for AutoDeploy backend - AutoDeploy backend...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ...t yet support disaggregated serving. WIP - For best performance: - Prefer `compil...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ...ted serving. WIP - For best performance: - Prefer compile_backend: torch-opt - ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/torch/auto_deploy/advanced/serving_with_trtllm_serve.md (1)

74-77: Add cross-link to the Compile Backends section in the Support Matrix

Paths and anchors have been verified in docs/source/torch/auto_deploy/support_matrix.md (heading “Compile Backends” → anchor #compile-backends) and the relative path from this file is correct. Please apply the following update:

 ## See also

 - [AutoDeploy overview](../auto-deploy.md)
 - [Benchmarking with trtllm-bench](./benchmarking_with_trtllm_bench.md)
+ - [Compilation backends support matrix](../support_matrix.md#compile-backends)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16129 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12132 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for doc and trtllm-serve part changes.

@suyoggupta suyoggupta merged commit e3de575 into NVIDIA:main Aug 22, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in AutoDeploy Board Aug 22, 2025
@suyoggupta suyoggupta deleted the sg/trtllm-serve-autodeploy branch August 22, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants