[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615

yizhang-nv · 2025-08-05T03:30:39Z

Summary by CodeRabbit

New Features
- Added support for specifying token counts for piecewise CUDA graph capture during torch compilation via new configuration options.
- Introduced piecewise CUDA graph warmup with multiple forward passes per batch.
- Added configuration fields to customize CUDA graph capture behavior.
Refactor
- Unified and clarified warmup state management across model engine and executor components using properties.
- Simplified warmup logic and replaced explicit CUDA graph state tracking with context managers.
- Introduced thread-local flag and context manager to control multi-stream execution, replacing previous graph capturing checks.
- Renamed internal parameters and variables from batch sizes to token capture counts for improved clarity.
- Updated multi-stream CUDA synchronization logic to use new thread-local flag.
Bug Fixes
- Improved runtime checks for multi-stream scheduling to ensure correct CUDA graph capturing state.
Style
- Renamed method in MoE load balancer for clearer naming consistency.
Tests
- Updated tests to reflect method renaming in MoE load balancer.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-05T03:30:47Z

📝 Walkthrough

Walkthrough

This update introduces and propagates support for specifying token counts for piecewise CUDA graph capture in torch compile workflows. It unifies warmup state management using properties, refactors warmup logic for clarity, and updates related configuration and method names. It also replaces the multi-stream scheduling condition from a CUDA graph capturing check to a thread-local flag, adds new context managers for piecewise CUDA graph control, and renames a method in MoeLoadBalancer with corresponding test updates.

Changes

Cohort / File(s)	Change Summary
Piecewise CUDA Graph Token Support `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/pyexecutor/config.py`, `tensorrt_llm/_torch/compilation/backend.py`, `tensorrt_llm/_torch/compilation/piecewise_optimizer.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds `capture_num_tokens` field to `TorchCompileConfig` and propagates it to backend config; renames parameter from `cuda_graph_batch_sizes` to `capture_num_tokens` in backend and piecewise optimizer; updates model engine to support and use piecewise CUDA graph token counts for warmup and execution.
Warmup State Refactor `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Replaces direct warmup flag attribute with `is_warmup` property; updates warmup logic to use the property and clarifies state transitions; synchronizes warmup state across model engines.
Multi-Stream Scheduling and CUDA Graph Capturing Control `tensorrt_llm/_torch/modules/multi_stream_utils.py`, `tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`, `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`, `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`	Introduces thread-local `do_multi_stream` flag with context manager to control multi-stream execution; replaces previous `is_graph_capturing()` checks with `do_multi_stream()` in MoeLoadBalancer and custom ops; refactors CUDA graph capture method to use new context managers for multi-stream and piecewise CUDA graph flags.
MoeLoadBalancer Method Rename and Test Updates `tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`, `tests/unittest/_torch/modules/test_moe_load_balancer.py`	Renames method `set_next_iter_info` to `set_iter_info` in MoeLoadBalancer and updates all test usages accordingly.
KvCacheCreator Shutdown Sequence `tensorrt_llm/_torch/pyexecutor/_util.py`	Changes the order of setting `is_warmup` flag and calling `py_executor.shutdown()` in `estimate_max_tokens`.
New Context Managers for Piecewise CUDA Graph `tensorrt_llm/_torch/utils.py`, `tensorrt_llm/_torch/compilation/utils.py`	Adds `piecewise_cuda_graph(enable: bool)` and `piecewise_cuda_graph_capture(enable: bool)` context managers to temporarily set piecewise CUDA graph flags within a scoped context.

Sequence Diagram(s)

sequenceDiagram
    participant API as LlmArgs / TorchCompileConfig
    participant Backend as PyTorchConfig
    participant Engine as PyTorchModelEngine

    API->>Backend: Provide capture_num_tokens (torch_compile_piecewise_cuda_graph_num_tokens)
    Backend->>Engine: Pass piecewise CUDA graph token counts in config
    Engine->>Engine: Use token counts for piecewise CUDA graph warmup and execution

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[TRTLLM-4279] feat: Multistream initial support for torch compile flow #5847: Extends multi-stream scheduling logic in Backend.optimize, which is directly built upon in this PR with the addition of the is_graph_capturing() runtime check.

Suggested labels

Community want to contribute

Suggested reviewers

chzblych
pcastonguay
litaotju
shaharmor98

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c838b7e and eab12f9.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_args.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/llmapi/llm_args.py

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

408-412: Split the long initialization line for better readability.

The line exceeds the 120-character limit and contains complex fallback logic that would be clearer if split.

-        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
+        self._piecewise_cuda_graph_num_tokens = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            else pytorch_backend_config.cuda_graph_batch_sizes
+            if pytorch_backend_config.cuda_graph_batch_sizes
+            else []
+        )

678-706: Good simplification of torch compile warmup logic.

The removal of contextlib.ExitStack makes the code clearer. Consider splitting the long logging line for consistency.

-                            logger.info(
-                                f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase"
-                            )
+                            phase_type = 'context' if num_tokens_per_request > 1 else 'generation'
+                            logger.info(
+                                f"Run warmup for batch size={bs}, pure {phase_type} phase"
+                            )

726-788: Well-structured CUDA graph warmup with new piecewise support.

The reorganization improves clarity, and the reverse sorting enables memory reuse optimization. The new piecewise CUDA graph warmup section correctly implements multiple forward passes with proper memory cleanup.

-        logger.info(
-            f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes."
-        )
+        num_batch_sizes = len(self._cuda_graph_batch_sizes)
+        logger.info(f"Creating CUDA graph instances for {num_batch_sizes} batch sizes.")

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a178cea and 80e16c3.

📒 Files selected for processing (8)

tensorrt_llm/_torch/compilation/backend.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (5)

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/compilation/backend.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

408-408: Line too long (296 > 120)

(E501)

700-700: Line too long (137 > 120)

(E501)

733-733: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (15)

tests/unittest/_torch/modules/test_moe_load_balancer.py (3)

270-270: LGTM! Method rename correctly reflected in test.

The test correctly uses the renamed method set_iter_info instead of set_next_iter_info, maintaining consistency with the implementation changes.

309-309: LGTM! Consistent method rename in test.

The method call correctly reflects the API change from set_next_iter_info to set_iter_info with appropriate parameters for the statistic kernel test.

371-371: LGTM! Final method rename correctly applied.

The test properly uses the renamed set_iter_info method, completing the consistent update across all test cases to align with the implementation changes.

tensorrt_llm/_torch/pyexecutor/config.py (1)

74-74: LGTM! Clean addition of new configuration field.

The new torch_compile_piecewise_cuda_graph_num_tokens field is well-named, follows the existing naming convention, and has an appropriate type annotation for specifying token counts for piecewise CUDA graph capture.

tensorrt_llm/_torch/compilation/backend.py (2)

15-15: LGTM! Import addition supports new functionality.

The import of is_graph_capturing is necessary for the enhanced CUDA graph capture state checking implemented later in the file.

117-118: LGTM! Enhanced condition adds important safeguard.

The addition of is_graph_capturing() check ensures multi-stream scheduling only runs when CUDA graph capturing is active, providing better control over the scheduling behavior and aligning with the unified CUDA graph capture management.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

246-247: LGTM! Improved shutdown sequence ensures proper state management.

Reordering the warmup flag clearing before shutdown ensures consistent state management and prevents potential issues during resource cleanup. This aligns well with the unified warmup state management approach.

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2)

795-796: LGTM! Method rename improves API clarity.

The rename from set_next_iter_info to set_iter_info makes the method name more concise and clear while maintaining the same functionality and parameters.

942-943: LGTM! Method call correctly updated to use renamed method.

The method call properly reflects the rename from set_next_iter_info to set_iter_info, maintaining consistency with the method definition change.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

212-217: LGTM! Clean warmup state management.

The warmup state is properly encapsulated and the flow is clear: set warmup flag, perform warmup operations, then disable warmup. The comment about profiling behavior during warmup adds helpful context.

272-283: Propagate the warmup flag and validate its type

The is_warmup setter in py_executor.py should drive the corresponding flag on the underlying engines (to trigger torch.compile optimizations and suppress MoE stats) and guard against non‐bool inputs.

– File: tensorrt_llm/_torch/pyexecutor/py_executor.py
Lines ~277–283

Suggested change:
 @is_warmup.setter
 def is_warmup(self, value: bool):
+    if not isinstance(value, bool):
+        raise TypeError(f"is_warmup must be bool, got {type(value).__name__}")
     self._is_warmup = value
-    # Set warmup flag in model engine to trigger torch compile and avoid moe load balancer statistics update
-    # self.model_engine.is_warmup = value
-    # if self.draft_model_engine is not None:
-    #     self.draft_model_engine.is_warmup = value
+    # Propagate warmup state to the model engines for compile and MoE behavior
+    self.model_engine.is_warmup = value
+    if self.draft_model_engine is not None:
+        self.draft_model_engine.is_warmup = value
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

356-356: LGTM! Clean refactoring of warmup state management.

The property-based approach for is_warmup provides better encapsulation and centralizes the logic for toggling torch compile optimizations and MoeLoadBalancer settings.

Also applies to: 465-485

492-497: LGTM! Consistent with property refactoring.

The context manager correctly uses the new is_warmup property while maintaining proper state restoration.

707-725: LGTM! Consistent refactoring of autotuner warmup.

The autotuner warmup follows the same simplified pattern as the torch compile warmup, making the code more maintainable.

2136-2137: LGTM! Simplified MoeLoadBalancer retrieval.

The removal of conditional warmup logic here is correct since that's now handled in the is_warmup property setter.

tensorrt_llm/llmapi/llm_args.py

coderabbitai · 2025-08-05T03:34:07Z

tensorrt_llm/llmapi/llm_args.py

+            torch_compile_piecewise_cuda_graph_num_tokens=self.
+            torch_compile_config.capture_num_tokens
+            if self.torch_compile_config is not None else
+            TorchCompileConfig.model_fields['capture_num_tokens'].default,


🛠️ Refactor suggestion

Fallback should respect CUDA-graph batch sizes when unspecified.

The spec says: “If capture_num_tokens is None, default to the same values as cuda_graph_config.batch_sizes”.
Current code forwards None (or the class default) to PyTorchConfig, losing that linkage.

Update logic:

- torch_compile_piecewise_cuda_graph_num_tokens=self. - torch_compile_config.capture_num_tokens - if self.torch_compile_config is not None else - TorchCompileConfig.model_fields['capture_num_tokens'].default, + torch_compile_piecewise_cuda_graph_num_tokens=( + # explicit > inherit from cuda-graph > None + (self.torch_compile_config.capture_num_tokens + if self.torch_compile_config and + self.torch_compile_config.capture_num_tokens is not None + else self.cuda_graph_config.batch_sizes + if self.cuda_graph_config else None) + ),

This guarantees downstream warm-up receives a concrete list when the user doesn’t override it, matching the documented behavior.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

torch_compile_piecewise_cuda_graph_num_tokens=self.

torch_compile_config.capture_num_tokens

if self.torch_compile_config is not None else

TorchCompileConfig.model_fields['capture_num_tokens'].default,

torch_compile_piecewise_cuda_graph_num_tokens=(

# explicit > inherit from cuda-graph > None

(self.torch_compile_config.capture_num_tokens

if self.torch_compile_config and

self.torch_compile_config.capture_num_tokens is not None

else self.cuda_graph_config.batch_sizes

if self.cuda_graph_config else None)

),

🤖 Prompt for AI Agents

In tensorrt_llm/llmapi/llm_args.py around lines 2300 to 2303, the fallback for capture_num_tokens does not respect the CUDA-graph batch sizes as specified. Modify the logic so that if capture_num_tokens is None, it defaults to the same values as cuda_graph_config.batch_sizes instead of forwarding None or the class default. This ensures downstream warm-up receives a concrete list consistent with the documented behavior.

yizhang-nv · 2025-08-05T04:15:21Z

/bot run

tensorrt-cicd · 2025-08-05T04:21:15Z

PR_Github #14078 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-05T06:29:10Z

PR_Github #14078 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10623 completed with status: 'FAILURE'

yizhang-nv · 2025-08-06T03:13:25Z

/bot run

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80e16c3 and 52b5a78.

📒 Files selected for processing (13)

tensorrt_llm/_torch/compilation/backend.py (4 hunks)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (7 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (11 hunks)
tensorrt_llm/_torch/modules/multi_stream_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/utils.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

✅ Files skipped from review due to trivial changes (4)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/llmapi/llm_args.py

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/_torch/pyexecutor/config.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/_util.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

322-322: Line too long (296 > 120)

(E501)

699-699: Line too long (137 > 120)

(E501)

732-732: Line too long (133 > 120)

(E501)

🔇 Additional comments (16)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)

6-6: LGTM! Import addition is correct.

The import of with_multi_stream is properly sourced from the multi_stream_utils module and is used correctly in the capture method.

8-8: LGTM! Import addition is correct.

The import of piecewise_cuda_graph is properly added and used as a context manager in the capture method.

80-84: Excellent refactoring using context managers for state management.

The nested context managers replace manual flag management with a more robust approach that ensures proper cleanup. The warmup and capture logic remains unchanged while improving maintainability.

tensorrt_llm/_torch/modules/multi_stream_utils.py (5)

1-2: LGTM! Required imports for thread-local state management.

The threading and contextmanager imports are necessary and correctly added for the new functionality.

8-14: Thread-local implementation is correct.

The thread-local class properly inherits from threading.local and initializes the flag to a sensible default. The global instance pattern is appropriate for module-level state management.

17-23: LGTM! Clean state management functions.

The setter and getter functions provide proper encapsulation of the thread-local state with clear type hints and focused responsibilities.

25-32: Excellent context manager implementation.

The context manager properly saves and restores the previous state using try/finally, ensuring robust cleanup even if exceptions occur.

60-62: Good refactoring of multi-stream decision logic.

The change from is_graph_capturing() to do_multi_stream() decouples the multi-stream decision from graph capturing state, providing better control. The variable rename to multi_stream correctly avoids shadowing the function name.
tensorrt_llm/_torch/pyexecutor/model_engine.py (8)

338-338: LGTM!

The parameter rename from cuda_graph_batch_sizes to capture_num_tokens improves semantic clarity, better reflecting that it specifies token counts for piecewise CUDA graph capture.

361-361: LGTM!

Clean initialization of the new warmup state property.

464-484: Well-designed property implementation for warmup state.

The property pattern effectively centralizes warmup state management and properly coordinates torch compile optimizations, CUDA graph capture flags, and MoeLoadBalancer settings.

Consider using a more explicit private attribute name in the getter:
 @property
 def is_warmup(self):
-    return getattr(self, "_is_warmup", False)
+    return self._is_warmup
And initialize self._is_warmup = False in the constructor instead of using getattr with a default.

491-496: LGTM!

Clean update to use the new is_warmup property while preserving the same context manager behavior.

677-705: Excellent simplification of torch compile warmup logic.

The refactoring removes complex ExitStack management in favor of cleaner context managers. The no_cuda_graph() context manager provides better isolation during warmup.

Address the line length violation on line 699:
-                            logger.info(
-                                f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase"
-                            )
+                            phase_type = 'context' if num_tokens_per_request > 1 else 'generation'
+                            logger.info(
+                                f"Run warmup for batch size={bs}, pure {phase_type} phase"
+                            )
725-787: Well-structured CUDA graph warmup with piecewise support.

The reorganized CUDA graph warmup logic is clearer with explicit sorting and separation of concerns. The new piecewise CUDA graph warmup section properly handles multiple iterations and memory cleanup.

Address the line length violation on line 732:
-        logger.info(
-            f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes."
-        )
+        num_batch_sizes = len(self._cuda_graph_batch_sizes)
+        logger.info(f"Creating CUDA graph instances for {num_batch_sizes} batch sizes.")
The piecewise warmup logic (lines 767-787) looks correct with 3 warmup iterations plus one final iteration, followed by proper GPU memory cleanup.

1525-1525: LGTM!

Correct usage of the new is_warmup property for CUDA graph warmup detection.

2135-2136: LGTM!

Clean formatting of the MoeLoadBalancer reference retrieval.

coderabbitai · 2025-08-06T03:14:50Z

tensorrt_llm/_torch/pyexecutor/model_engine.py

+        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
+        self._piecewise_cuda_graph_num_tokens = [
+            i for i in self._piecewise_cuda_graph_num_tokens
+            if i <= self.max_num_tokens
+        ]
+


🛠️ Refactor suggestion

Refactor complex initialization logic for better readability.

The nested ternary operators make this initialization difficult to read and maintain. Consider breaking this into multiple lines with clear variable names.

- self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] - self._piecewise_cuda_graph_num_tokens = [ - i for i in self._piecewise_cuda_graph_num_tokens - if i <= self.max_num_tokens - ] + # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes + raw_num_tokens = ( + pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens + or pytorch_backend_config.cuda_graph_batch_sizes + or [] + ) + self._piecewise_cuda_graph_num_tokens = [ + i for i in raw_num_tokens if i <= self.max_num_tokens + ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []

self._piecewise_cuda_graph_num_tokens = [

i for i in self._piecewise_cuda_graph_num_tokens

if i <= self.max_num_tokens

]

# Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes

raw_num_tokens = (

pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens

or pytorch_backend_config.cuda_graph_batch_sizes

or []

)

self._piecewise_cuda_graph_num_tokens = [

i for i in raw_num_tokens if i <= self.max_num_tokens

]

🧰 Tools

🪛 Ruff (0.12.2)

322-322: Line too long (296 > 120)

(E501)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 322 to 327, the initialization of _piecewise_cuda_graph_num_tokens uses nested ternary operators, making it hard to read. Refactor by assigning intermediate variables with clear names for each condition, then set _piecewise_cuda_graph_num_tokens based on these variables in separate lines to improve readability and maintainability.

yizhang-nv · 2025-08-06T05:55:56Z

/bot run

yizhang-nv · 2025-08-06T05:56:34Z

/bot run

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

322-327: Refactor complex initialization logic for better readability.

The nested ternary operators make this initialization difficult to read and maintain, and line 322 exceeds the 120-character limit. Consider breaking this into multiple lines with clear variable names.

-        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
-        self._piecewise_cuda_graph_num_tokens = [
-            i for i in self._piecewise_cuda_graph_num_tokens
-            if i <= self.max_num_tokens
-        ]
+        # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes
+        raw_num_tokens = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            or pytorch_backend_config.cuda_graph_batch_sizes
+            or []
+        )
+        self._piecewise_cuda_graph_num_tokens = [
+            i for i in raw_num_tokens if i <= self.max_num_tokens
+        ]

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

677-788: LGTM: Warmup method refactoring improves clarity.

The refactoring successfully:

Uses no_cuda_graph() context manager for cleaner CUDA graph control

Adds proper piecewise CUDA graph warmup with piecewise_cuda_graph_capture(True) context manager

Reorganizes the logic for better readability

However, consider addressing the line length violations (lines 699, 732 exceed 120 characters) for better code style compliance.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52b5a78 and 4759be8.

📒 Files selected for processing (14)

tensorrt_llm/_torch/compilation/backend.py (4 hunks)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (7 hunks)
tensorrt_llm/_torch/compilation/utils.py (2 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (11 hunks)
tensorrt_llm/_torch/modules/multi_stream_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/utils.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

✅ Files skipped from review due to trivial changes (3)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py

🚧 Files skipped from review as they are similar to previous changes (9)

tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline, and attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.295Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

322-322: Line too long (296 > 120)

(E501)

474-474: Undefined name set_enable_piecewise_cuda_graph_capture_flag

(F821)

477-477: Undefined name set_enable_piecewise_cuda_graph_capture_flag

(F821)

699-699: Line too long (137 > 120)

(E501)

732-732: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/_torch/compilation/utils.py (2)

1-1: LGTM!

The contextlib import is correctly added to support the new context manager function.

47-54: Excellent implementation of the context manager!

The piecewise_cuda_graph_capture context manager correctly follows the standard pattern:

Saves previous state using the existing getter function

Sets the new state using the existing setter function

Uses try/finally for exception safety to ensure state restoration

Leverages existing helper functions rather than directly accessing globals

This provides a clean, safe way to manage the piecewise CUDA graph capture flag in scoped contexts.

tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

41-41: LGTM: Import updated to use new context manager.

The import change from set_enable_piecewise_cuda_graph_capture_flag to piecewise_cuda_graph_capture aligns with the new scoped flag management pattern using context managers.

338-338: LGTM: Backend configuration updated correctly.

The addition of capture_num_tokens=self._piecewise_cuda_graph_num_tokens properly propagates the piecewise CUDA graph configuration to the backend.

361-361: LGTM: Warmup state initialization added.

The initialization of is_warmup = False properly sets up the new property-based warmup state management.

491-496: LGTM: Context manager updated to use new property.

The set_warmup_flag context manager correctly uses the new is_warmup property, maintaining consistency with the property-based warmup state management.

1525-1525: LGTM: Updated to use new warmup property.

The condition correctly uses self.is_warmup property, maintaining consistency with the new property-based warmup state management pattern.

2135-2136: LGTM: Improved MoeLoadBalancer usage pattern.

Using getattr with a default None value makes the MoeLoadBalancer access more robust and handles cases where it might not be present.

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt-cicd · 2025-08-06T06:02:00Z

PR_Github #14247 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-06T08:22:16Z

PR_Github #14247 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10757 completed with status: 'FAILURE'

yizhang-nv · 2025-08-06T09:47:43Z

/bot run

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

322-327: Refactor complex initialization logic for better readability.

The nested ternary operators make this initialization difficult to read and maintain. Consider breaking this into multiple lines with clear variable names.

-        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
-        self._piecewise_cuda_graph_num_tokens = [
-            i for i in self._piecewise_cuda_graph_num_tokens
-            if i <= self.max_num_tokens
-        ]
+        # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes
+        raw_num_tokens = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            or pytorch_backend_config.cuda_graph_batch_sizes
+            or []
+        )
+        self._piecewise_cuda_graph_num_tokens = [
+            i for i in raw_num_tokens if i <= self.max_num_tokens
+        ]

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
675-703: Approve torch compile warmup refactoring with line length fix needed.

The refactored torch compile warmup logic is much cleaner, using the no_cuda_graph context manager instead of complex callback mechanisms. However, line 697 exceeds the 120 character limit.

Apply this diff to fix the line length violation:
-                            logger.info(
-                                f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase"
-                            )
+                            phase_type = 'context' if num_tokens_per_request > 1 else 'generation'
+                            logger.info(
+                                f"Run warmup for batch size={bs}, pure {phase_type} phase"
+                            )
723-761: Approve CUDA graph warmup reorganization with line length fix needed.

The CUDA graph warmup logic is well-organized with sorted batch sizes and proper loop structure. However, line 730 exceeds the 120 character limit.

Apply this diff to fix the line length violation:
-        logger.info(
-            f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes."
-        )
+        num_batch_sizes = len(self._cuda_graph_batch_sizes)
+        logger.info(
+            f"Creating CUDA graph instances for {num_batch_sizes} batch sizes."
+        )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e60fc2 and f3c0c1c.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (8 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/llmapi/llm_args.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.690Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

322-322: Line too long (296 > 120)

(E501)

697-697: Line too long (137 > 120)

(E501)

730-730: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (6)

tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

41-41: LGTM! Import change aligns with context manager refactoring.

The import change from set_enable_piecewise_cuda_graph_capture_flag to piecewise_cuda_graph_capture context manager is consistent with the broader refactoring to improve scoped flag management.

338-338: LGTM! Proper propagation of piecewise CUDA graph token counts.

The capture_num_tokens parameter correctly propagates the filtered token counts to the Backend, enabling piecewise CUDA graph capture configuration.

464-482: LGTM! Clean property-based warmup state management.

The is_warmup property implementation properly:

Uses a safe default value in the getter

Controls torch compile optimization state in the setter

Manages MoeLoadBalancer iteration info with the renamed set_iter_info method

Eliminates the need for direct flag manipulation

This addresses the previous undefined function issues and aligns with the new context manager pattern.

704-722: LGTM! Simplified autotuner warmup logic.

The autotuner warmup is now cleaner with the no_cuda_graph context manager and proper cleanup of the profiling cache. The logic flow is much more readable.

762-786: Excellent addition of piecewise CUDA graph warmup.

The new piecewise CUDA graph warmup section properly:

Uses the piecewise_cuda_graph_capture context manager

Runs multiple forward passes for proper graph capture

Includes memory cleanup after each warmup iteration

Follows the same pattern as other warmup sections

This is a solid implementation that integrates well with the overall warmup refactoring.

2133-2135: LGTM! Simplified MoeLoadBalancer retrieval.

The simplified logic that always retrieves the moe_load_balancer removes conditional complexity, since the load balancer state is now properly managed through the is_warmup property setter.

tensorrt-cicd · 2025-08-06T09:52:45Z

PR_Github #14293 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-06T10:12:23Z

PR_Github #14293 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10796 completed with status: 'FAILURE'

yizhang-nv · 2025-08-07T02:13:12Z

/bot run

tensorrt-cicd · 2025-08-07T02:18:14Z

PR_Github #14356 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-07T02:37:22Z

PR_Github #14356 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10849 completed with status: 'FAILURE'

…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yi Zhang <[email protected]>

…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>

Signed-off-by: yizhang-nv <[email protected]>

yizhang-nv · 2025-08-07T02:41:02Z

/bot run

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

322-327: Refactor complex initialization logic for better readability.

This complex nested ternary operator initialization makes the code difficult to read and maintain, and exceeds the 120-character line length limit.

Consider breaking this into multiple lines with clear variable names:

-        self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
-        self._piecewise_cuda_graph_num_tokens = [
-            i for i in self._piecewise_cuda_graph_num_tokens
-            if i <= self.max_num_tokens
-        ]
+        # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes
+        raw_num_tokens = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+            or pytorch_backend_config.cuda_graph_batch_sizes
+            or []
+        )
+        self._piecewise_cuda_graph_num_tokens = [
+            i for i in raw_num_tokens if i <= self.max_num_tokens
+        ]

464-482: Fix undefined function and align with new context manager pattern.

The property setter uses set_enable_piecewise_cuda_graph_capture_flag which is no longer imported and undefined. Since the import was changed to use the piecewise_cuda_graph_capture context manager, this property implementation needs to be updated to align with the new scoped flag management pattern.

Consider removing the direct flag manipulation from the property setter since the new context manager pattern handles this more cleanly in the warmup method itself:

 @is_warmup.setter
 def is_warmup(self, value: bool):
     self._is_warmup = value
     if self._torch_compile_enabled:
         if value:
             self._torch_compile_backend.enable_optimization()
-            set_enable_piecewise_cuda_graph_capture_flag(True)
         else:
             self._torch_compile_backend.bypass_optimization()
-            set_enable_piecewise_cuda_graph_capture_flag(False)
     
     moe_load_balancer: MoeLoadBalancer = getattr(self, 'moe_load_balancer',
                                                  None)
     if moe_load_balancer is not None:
         moe_load_balancer.set_iter_info(enable_statistic=not value,
                                         enable_update_weights=not value)

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

675-786: Approve warmup method refactoring with minor style improvements.

The warmup method refactoring successfully simplifies the logic by using context managers (no_cuda_graph(), piecewise_cuda_graph_capture()) instead of the previous contextlib.ExitStack approach. The new piecewise CUDA graph warmup section is well-implemented with proper memory cleanup.

Consider breaking up the long lines to improve readability:

-                            min(self.max_num_tokens // max(bs, 1),
-                                min(available_tokens, self.max_seq_len - 1))
+                            min(
+                                self.max_num_tokens // max(bs, 1),
+                                min(available_tokens, self.max_seq_len - 1)
+                            )

-        logger.info(
-            f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes."
-        )
+        logger.info(
+            f"Creating CUDA graph instances for "
+            f"{len(self._cuda_graph_batch_sizes)} batch sizes."
+        )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61c5535 and c838b7e.

📒 Files selected for processing (14)

tensorrt_llm/_torch/compilation/backend.py (4 hunks)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (7 hunks)
tensorrt_llm/_torch/compilation/utils.py (2 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (11 hunks)
tensorrt_llm/_torch/modules/multi_stream_utils.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/utils.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tensorrt_llm/llmapi/llm_args.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (1)

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

322-322: Line too long (296 > 120)

(E501)

697-697: Line too long (137 > 120)

(E501)

730-730: Line too long (133 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

tensorrt_llm/_torch/pyexecutor/model_engine.py (5)

41-41: LGTM!

The import change from set_enable_piecewise_cuda_graph_capture_flag to piecewise_cuda_graph_capture context manager aligns well with the refactoring goal of replacing direct flag manipulation with scoped context management.

361-361: LGTM!

The initialization of the new is_warmup property provides a clean replacement for the previous in_warmup boolean attribute while leveraging the property's setter to handle related state management.

489-494: LGTM!

The context manager now properly uses the is_warmup property, which automatically handles torch compile backend and MOE load balancer state changes through the property setter. This is a clean improvement over the previous boolean attribute approach.

1523-1523: LGTM!

The update to use self.is_warmup property is consistent with the warmup state management refactoring.

2133-2134: LGTM!

The simplification of MOE load balancer retrieval by always using getattr without conditional logic is a clean improvement that reduces complexity in the forward method.

tensorrt-cicd · 2025-08-07T02:46:47Z

PR_Github #14361 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-07T04:25:06Z

PR_Github #14361 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10853 completed with status: 'FAILURE'

Signed-off-by: Yi Zhang <[email protected]>

yizhang-nv · 2025-08-07T06:44:55Z

/bot run

tensorrt-cicd · 2025-08-07T06:50:59Z

PR_Github #14405 [ run ] triggered by Bot

yizhang-nv requested review from a team as code owners August 5, 2025 03:30

yizhang-nv requested review from Superjomn, achartier, hlu1 and suyoggupta August 5, 2025 03:30

coderabbitai bot reviewed Aug 5, 2025

View reviewed changes

yizhang-nv requested review from zongfeijing and liji-nv August 5, 2025 03:34

yizhang-nv force-pushed the enhance-torch-compile branch from 80e16c3 to 52b5a78 Compare August 6, 2025 03:11

yizhang-nv requested review from a team as code owners August 6, 2025 03:11

yizhang-nv requested review from Tabrizian and chuangz0 August 6, 2025 03:11

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

yizhang-nv force-pushed the enhance-torch-compile branch from 52b5a78 to 4759be8 Compare August 6, 2025 05:56

yizhang-nv changed the title ~~refactor: Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic~~ [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic Aug 6, 2025

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

yizhang-nv and others added 4 commits August 7, 2025 10:40

Refactor Backend and MoeLoadBalancer to improve warmup handling and C…

adca83b

…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>

Update tensorrt_llm/llmapi/llm_args.py

0c1ff64

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yi Zhang <[email protected]>

Refactor Backend and MoeLoadBalancer to improve warmup handling and C…

e40b92a

…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>

fix typo

c838b7e

Signed-off-by: yizhang-nv <[email protected]>

yizhang-nv force-pushed the enhance-torch-compile branch from 61c5535 to c838b7e Compare August 7, 2025 02:40

coderabbitai bot reviewed Aug 7, 2025

View reviewed changes

a

eab12f9

Signed-off-by: Yi Zhang <[email protected]>

-            torch_compile_piecewise_cuda_graph_num_tokens=self.
-            torch_compile_config.capture_num_tokens
-            if self.torch_compile_config is not None else
-            TorchCompileConfig.model_fields['capture_num_tokens'].default,
+            torch_compile_piecewise_cuda_graph_num_tokens=(
+                # explicit > inherit from cuda-graph > None
+                (self.torch_compile_config.capture_num_tokens
+                 if self.torch_compile_config and
+                 self.torch_compile_config.capture_num_tokens is not None
+                 else self.cuda_graph_config.batch_sizes
+                 if self.cuda_graph_config else None)
+            ),

[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615

Are you sure you want to change the base?

[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615

Conversation

yizhang-nv commented Aug 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

yizhang-nv commented Aug 5, 2025

Uh oh!

tensorrt-cicd commented Aug 5, 2025

Uh oh!

tensorrt-cicd commented Aug 5, 2025

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

yizhang-nv commented Aug 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

yizhang-nv commented Aug 7, 2025

Uh oh!

tensorrt-cicd commented Aug 7, 2025

Uh oh!

tensorrt-cicd commented Aug 7, 2025

Uh oh!

yizhang-nv commented Aug 7, 2025

Uh oh!

coderabbitai bot left a comment

yizhang-nv commented Aug 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 5, 2025 •

edited

Loading