-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic #6615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis update introduces and propagates support for specifying token counts for piecewise CUDA graph capture in torch compile workflows. It unifies warmup state management using properties, refactors warmup logic for clarity, and updates related configuration and method names. It also replaces the multi-stream scheduling condition from a CUDA graph capturing check to a thread-local flag, adds new context managers for piecewise CUDA graph control, and renames a method in MoeLoadBalancer with corresponding test updates. Changes
Sequence Diagram(s)sequenceDiagram
participant API as LlmArgs / TorchCompileConfig
participant Backend as PyTorchConfig
participant Engine as PyTorchModelEngine
API->>Backend: Provide capture_num_tokens (torch_compile_piecewise_cuda_graph_num_tokens)
Backend->>Engine: Pass piecewise CUDA graph token counts in config
Engine->>Engine: Use token counts for piecewise CUDA graph warmup and execution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
408-412
: Split the long initialization line for better readability.The line exceeds the 120-character limit and contains complex fallback logic that would be clearer if split.
- self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] + self._piecewise_cuda_graph_num_tokens = ( + pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens + if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens + else pytorch_backend_config.cuda_graph_batch_sizes + if pytorch_backend_config.cuda_graph_batch_sizes + else [] + )
678-706
: Good simplification of torch compile warmup logic.The removal of
contextlib.ExitStack
makes the code clearer. Consider splitting the long logging line for consistency.- logger.info( - f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase" - ) + phase_type = 'context' if num_tokens_per_request > 1 else 'generation' + logger.info( + f"Run warmup for batch size={bs}, pure {phase_type} phase" + )
726-788
: Well-structured CUDA graph warmup with new piecewise support.The reorganization improves clarity, and the reverse sorting enables memory reuse optimization. The new piecewise CUDA graph warmup section correctly implements multiple forward passes with proper memory cleanup.
- logger.info( - f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes." - ) + num_batch_sizes = len(self._cuda_graph_batch_sizes) + logger.info(f"Creating CUDA graph instances for {num_batch_sizes} batch sizes.")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/_torch/compilation/backend.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/unittest/_torch/modules/test_moe_load_balancer.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (5)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/compilation/backend.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/compilation/backend.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/compilation/backend.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/_torch/compilation/backend.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/compilation/backend.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
408-408: Line too long (296 > 120)
(E501)
700-700: Line too long (137 > 120)
(E501)
733-733: Line too long (133 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (15)
tests/unittest/_torch/modules/test_moe_load_balancer.py (3)
270-270
: LGTM! Method rename correctly reflected in test.The test correctly uses the renamed method
set_iter_info
instead ofset_next_iter_info
, maintaining consistency with the implementation changes.
309-309
: LGTM! Consistent method rename in test.The method call correctly reflects the API change from
set_next_iter_info
toset_iter_info
with appropriate parameters for the statistic kernel test.
371-371
: LGTM! Final method rename correctly applied.The test properly uses the renamed
set_iter_info
method, completing the consistent update across all test cases to align with the implementation changes.tensorrt_llm/_torch/pyexecutor/config.py (1)
74-74
: LGTM! Clean addition of new configuration field.The new
torch_compile_piecewise_cuda_graph_num_tokens
field is well-named, follows the existing naming convention, and has an appropriate type annotation for specifying token counts for piecewise CUDA graph capture.tensorrt_llm/_torch/compilation/backend.py (2)
15-15
: LGTM! Import addition supports new functionality.The import of
is_graph_capturing
is necessary for the enhanced CUDA graph capture state checking implemented later in the file.
117-118
: LGTM! Enhanced condition adds important safeguard.The addition of
is_graph_capturing()
check ensures multi-stream scheduling only runs when CUDA graph capturing is active, providing better control over the scheduling behavior and aligning with the unified CUDA graph capture management.tensorrt_llm/_torch/pyexecutor/_util.py (1)
246-247
: LGTM! Improved shutdown sequence ensures proper state management.Reordering the warmup flag clearing before shutdown ensures consistent state management and prevents potential issues during resource cleanup. This aligns well with the unified warmup state management approach.
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (2)
795-796
: LGTM! Method rename improves API clarity.The rename from
set_next_iter_info
toset_iter_info
makes the method name more concise and clear while maintaining the same functionality and parameters.
942-943
: LGTM! Method call correctly updated to use renamed method.The method call properly reflects the rename from
set_next_iter_info
toset_iter_info
, maintaining consistency with the method definition change.tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
212-217
: LGTM! Clean warmup state management.The warmup state is properly encapsulated and the flow is clear: set warmup flag, perform warmup operations, then disable warmup. The comment about profiling behavior during warmup adds helpful context.
272-283
: Propagate the warmup flag and validate its typeThe
is_warmup
setter inpy_executor.py
should drive the corresponding flag on the underlying engines (to trigger torch.compile optimizations and suppress MoE stats) and guard against non‐bool inputs.– File: tensorrt_llm/_torch/pyexecutor/py_executor.py
Lines ~277–283Suggested change:
@is_warmup.setter def is_warmup(self, value: bool): + if not isinstance(value, bool): + raise TypeError(f"is_warmup must be bool, got {type(value).__name__}") self._is_warmup = value - # Set warmup flag in model engine to trigger torch compile and avoid moe load balancer statistics update - # self.model_engine.is_warmup = value - # if self.draft_model_engine is not None: - # self.draft_model_engine.is_warmup = value + # Propagate warmup state to the model engines for compile and MoE behavior + self.model_engine.is_warmup = value + if self.draft_model_engine is not None: + self.draft_model_engine.is_warmup = valuetensorrt_llm/_torch/pyexecutor/model_engine.py (4)
356-356
: LGTM! Clean refactoring of warmup state management.The property-based approach for
is_warmup
provides better encapsulation and centralizes the logic for toggling torch compile optimizations and MoeLoadBalancer settings.Also applies to: 465-485
492-497
: LGTM! Consistent with property refactoring.The context manager correctly uses the new
is_warmup
property while maintaining proper state restoration.
707-725
: LGTM! Consistent refactoring of autotuner warmup.The autotuner warmup follows the same simplified pattern as the torch compile warmup, making the code more maintainable.
2136-2137
: LGTM! Simplified MoeLoadBalancer retrieval.The removal of conditional warmup logic here is correct since that's now handled in the
is_warmup
property setter.
torch_compile_piecewise_cuda_graph_num_tokens=self. | ||
torch_compile_config.capture_num_tokens | ||
if self.torch_compile_config is not None else | ||
TorchCompileConfig.model_fields['capture_num_tokens'].default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fallback should respect CUDA-graph batch sizes when unspecified.
The spec says: “If capture_num_tokens
is None
, default to the same values as cuda_graph_config.batch_sizes
”.
Current code forwards None
(or the class default) to PyTorchConfig
, losing that linkage.
Update logic:
- torch_compile_piecewise_cuda_graph_num_tokens=self.
- torch_compile_config.capture_num_tokens
- if self.torch_compile_config is not None else
- TorchCompileConfig.model_fields['capture_num_tokens'].default,
+ torch_compile_piecewise_cuda_graph_num_tokens=(
+ # explicit > inherit from cuda-graph > None
+ (self.torch_compile_config.capture_num_tokens
+ if self.torch_compile_config and
+ self.torch_compile_config.capture_num_tokens is not None
+ else self.cuda_graph_config.batch_sizes
+ if self.cuda_graph_config else None)
+ ),
This guarantees downstream warm-up receives a concrete list when the user doesn’t override it, matching the documented behavior.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
torch_compile_piecewise_cuda_graph_num_tokens=self. | |
torch_compile_config.capture_num_tokens | |
if self.torch_compile_config is not None else | |
TorchCompileConfig.model_fields['capture_num_tokens'].default, | |
torch_compile_piecewise_cuda_graph_num_tokens=( | |
# explicit > inherit from cuda-graph > None | |
(self.torch_compile_config.capture_num_tokens | |
if self.torch_compile_config and | |
self.torch_compile_config.capture_num_tokens is not None | |
else self.cuda_graph_config.batch_sizes | |
if self.cuda_graph_config else None) | |
), |
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_args.py around lines 2300 to 2303, the fallback for
capture_num_tokens does not respect the CUDA-graph batch sizes as specified.
Modify the logic so that if capture_num_tokens is None, it defaults to the same
values as cuda_graph_config.batch_sizes instead of forwarding None or the class
default. This ensures downstream warm-up receives a concrete list consistent
with the documented behavior.
/bot run |
PR_Github #14078 [ run ] triggered by Bot |
PR_Github #14078 [ run ] completed with state |
80e16c3
to
52b5a78
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
tensorrt_llm/_torch/compilation/backend.py
(4 hunks)tensorrt_llm/_torch/compilation/piecewise_optimizer.py
(7 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(11 hunks)tensorrt_llm/_torch/modules/multi_stream_utils.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)tensorrt_llm/_torch/utils.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/unittest/_torch/modules/test_moe_load_balancer.py
(3 hunks)
✅ Files skipped from review due to trivial changes (4)
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- tensorrt_llm/_torch/utils.py
- tensorrt_llm/_torch/compilation/piecewise_optimizer.py
- tensorrt_llm/llmapi/llm_args.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/_torch/pyexecutor/config.py
- tests/unittest/_torch/modules/test_moe_load_balancer.py
- tensorrt_llm/_torch/compilation/backend.py
- tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/pyexecutor/_util.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/modules/multi_stream_utils.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
322-322: Line too long (296 > 120)
(E501)
699-699: Line too long (137 > 120)
(E501)
732-732: Line too long (133 > 120)
(E501)
🔇 Additional comments (16)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
6-6
: LGTM! Import addition is correct.The import of
with_multi_stream
is properly sourced from the multi_stream_utils module and is used correctly in the capture method.
8-8
: LGTM! Import addition is correct.The import of
piecewise_cuda_graph
is properly added and used as a context manager in the capture method.
80-84
: Excellent refactoring using context managers for state management.The nested context managers replace manual flag management with a more robust approach that ensures proper cleanup. The warmup and capture logic remains unchanged while improving maintainability.
tensorrt_llm/_torch/modules/multi_stream_utils.py (5)
1-2
: LGTM! Required imports for thread-local state management.The threading and contextmanager imports are necessary and correctly added for the new functionality.
8-14
: Thread-local implementation is correct.The thread-local class properly inherits from
threading.local
and initializes the flag to a sensible default. The global instance pattern is appropriate for module-level state management.
17-23
: LGTM! Clean state management functions.The setter and getter functions provide proper encapsulation of the thread-local state with clear type hints and focused responsibilities.
25-32
: Excellent context manager implementation.The context manager properly saves and restores the previous state using try/finally, ensuring robust cleanup even if exceptions occur.
60-62
: Good refactoring of multi-stream decision logic.The change from
is_graph_capturing()
todo_multi_stream()
decouples the multi-stream decision from graph capturing state, providing better control. The variable rename tomulti_stream
correctly avoids shadowing the function name.tensorrt_llm/_torch/pyexecutor/model_engine.py (8)
338-338
: LGTM!The parameter rename from
cuda_graph_batch_sizes
tocapture_num_tokens
improves semantic clarity, better reflecting that it specifies token counts for piecewise CUDA graph capture.
361-361
: LGTM!Clean initialization of the new warmup state property.
464-484
: Well-designed property implementation for warmup state.The property pattern effectively centralizes warmup state management and properly coordinates torch compile optimizations, CUDA graph capture flags, and MoeLoadBalancer settings.
Consider using a more explicit private attribute name in the getter:
@property def is_warmup(self): - return getattr(self, "_is_warmup", False) + return self._is_warmupAnd initialize
self._is_warmup = False
in the constructor instead of usinggetattr
with a default.
491-496
: LGTM!Clean update to use the new
is_warmup
property while preserving the same context manager behavior.
677-705
: Excellent simplification of torch compile warmup logic.The refactoring removes complex
ExitStack
management in favor of cleaner context managers. Theno_cuda_graph()
context manager provides better isolation during warmup.Address the line length violation on line 699:
- logger.info( - f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase" - ) + phase_type = 'context' if num_tokens_per_request > 1 else 'generation' + logger.info( + f"Run warmup for batch size={bs}, pure {phase_type} phase" + )
725-787
: Well-structured CUDA graph warmup with piecewise support.The reorganized CUDA graph warmup logic is clearer with explicit sorting and separation of concerns. The new piecewise CUDA graph warmup section properly handles multiple iterations and memory cleanup.
Address the line length violation on line 732:
- logger.info( - f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes." - ) + num_batch_sizes = len(self._cuda_graph_batch_sizes) + logger.info(f"Creating CUDA graph instances for {num_batch_sizes} batch sizes.")The piecewise warmup logic (lines 767-787) looks correct with 3 warmup iterations plus one final iteration, followed by proper GPU memory cleanup.
1525-1525
: LGTM!Correct usage of the new
is_warmup
property for CUDA graph warmup detection.
2135-2136
: LGTM!Clean formatting of the MoeLoadBalancer reference retrieval.
self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] | ||
self._piecewise_cuda_graph_num_tokens = [ | ||
i for i in self._piecewise_cuda_graph_num_tokens | ||
if i <= self.max_num_tokens | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor complex initialization logic for better readability.
The nested ternary operators make this initialization difficult to read and maintain. Consider breaking this into multiple lines with clear variable names.
- self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else []
- self._piecewise_cuda_graph_num_tokens = [
- i for i in self._piecewise_cuda_graph_num_tokens
- if i <= self.max_num_tokens
- ]
+ # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes
+ raw_num_tokens = (
+ pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens
+ or pytorch_backend_config.cuda_graph_batch_sizes
+ or []
+ )
+ self._piecewise_cuda_graph_num_tokens = [
+ i for i in raw_num_tokens if i <= self.max_num_tokens
+ ]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] | |
self._piecewise_cuda_graph_num_tokens = [ | |
i for i in self._piecewise_cuda_graph_num_tokens | |
if i <= self.max_num_tokens | |
] | |
# Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes | |
raw_num_tokens = ( | |
pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens | |
or pytorch_backend_config.cuda_graph_batch_sizes | |
or [] | |
) | |
self._piecewise_cuda_graph_num_tokens = [ | |
i for i in raw_num_tokens if i <= self.max_num_tokens | |
] |
🧰 Tools
🪛 Ruff (0.12.2)
322-322: Line too long (296 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 322 to 327, the
initialization of _piecewise_cuda_graph_num_tokens uses nested ternary
operators, making it hard to read. Refactor by assigning intermediate variables
with clear names for each condition, then set _piecewise_cuda_graph_num_tokens
based on these variables in separate lines to improve readability and
maintainability.
/bot run |
52b5a78
to
4759be8
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
322-327
: Refactor complex initialization logic for better readability.The nested ternary operators make this initialization difficult to read and maintain, and line 322 exceeds the 120-character limit. Consider breaking this into multiple lines with clear variable names.
- self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] - self._piecewise_cuda_graph_num_tokens = [ - i for i in self._piecewise_cuda_graph_num_tokens - if i <= self.max_num_tokens - ] + # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes + raw_num_tokens = ( + pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens + or pytorch_backend_config.cuda_graph_batch_sizes + or [] + ) + self._piecewise_cuda_graph_num_tokens = [ + i for i in raw_num_tokens if i <= self.max_num_tokens + ]
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
677-788
: LGTM: Warmup method refactoring improves clarity.The refactoring successfully:
- Uses
no_cuda_graph()
context manager for cleaner CUDA graph control- Adds proper piecewise CUDA graph warmup with
piecewise_cuda_graph_capture(True)
context manager- Reorganizes the logic for better readability
However, consider addressing the line length violations (lines 699, 732 exceed 120 characters) for better code style compliance.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
tensorrt_llm/_torch/compilation/backend.py
(4 hunks)tensorrt_llm/_torch/compilation/piecewise_optimizer.py
(7 hunks)tensorrt_llm/_torch/compilation/utils.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(11 hunks)tensorrt_llm/_torch/modules/multi_stream_utils.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)tensorrt_llm/_torch/utils.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/unittest/_torch/modules/test_moe_load_balancer.py
(3 hunks)
✅ Files skipped from review due to trivial changes (3)
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/_torch/compilation/piecewise_optimizer.py
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/unittest/_torch/modules/test_moe_load_balancer.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/utils.py
- tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
- tensorrt_llm/_torch/modules/multi_stream_utils.py
- tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
- tensorrt_llm/_torch/compilation/backend.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline, and attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/compilation/utils.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T05:46:41.295Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
322-322: Line too long (296 > 120)
(E501)
474-474: Undefined name set_enable_piecewise_cuda_graph_capture_flag
(F821)
477-477: Undefined name set_enable_piecewise_cuda_graph_capture_flag
(F821)
699-699: Line too long (137 > 120)
(E501)
732-732: Line too long (133 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/compilation/utils.py (2)
1-1
: LGTM!The
contextlib
import is correctly added to support the new context manager function.
47-54
: Excellent implementation of the context manager!The
piecewise_cuda_graph_capture
context manager correctly follows the standard pattern:
- Saves previous state using the existing getter function
- Sets the new state using the existing setter function
- Uses try/finally for exception safety to ensure state restoration
- Leverages existing helper functions rather than directly accessing globals
This provides a clean, safe way to manage the piecewise CUDA graph capture flag in scoped contexts.
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
41-41
: LGTM: Import updated to use new context manager.The import change from
set_enable_piecewise_cuda_graph_capture_flag
topiecewise_cuda_graph_capture
aligns with the new scoped flag management pattern using context managers.
338-338
: LGTM: Backend configuration updated correctly.The addition of
capture_num_tokens=self._piecewise_cuda_graph_num_tokens
properly propagates the piecewise CUDA graph configuration to the backend.
361-361
: LGTM: Warmup state initialization added.The initialization of
is_warmup = False
properly sets up the new property-based warmup state management.
491-496
: LGTM: Context manager updated to use new property.The
set_warmup_flag
context manager correctly uses the newis_warmup
property, maintaining consistency with the property-based warmup state management.
1525-1525
: LGTM: Updated to use new warmup property.The condition correctly uses
self.is_warmup
property, maintaining consistency with the new property-based warmup state management pattern.
2135-2136
: LGTM: Improved MoeLoadBalancer usage pattern.Using
getattr
with a defaultNone
value makes the MoeLoadBalancer access more robust and handles cases where it might not be present.
PR_Github #14247 [ run ] triggered by Bot |
PR_Github #14247 [ run ] completed with state |
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
322-327
: Refactor complex initialization logic for better readability.The nested ternary operators make this initialization difficult to read and maintain. Consider breaking this into multiple lines with clear variable names.
- self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] - self._piecewise_cuda_graph_num_tokens = [ - i for i in self._piecewise_cuda_graph_num_tokens - if i <= self.max_num_tokens - ] + # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes + raw_num_tokens = ( + pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens + or pytorch_backend_config.cuda_graph_batch_sizes + or [] + ) + self._piecewise_cuda_graph_num_tokens = [ + i for i in raw_num_tokens if i <= self.max_num_tokens + ]
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
675-703
: Approve torch compile warmup refactoring with line length fix needed.The refactored torch compile warmup logic is much cleaner, using the
no_cuda_graph
context manager instead of complex callback mechanisms. However, line 697 exceeds the 120 character limit.Apply this diff to fix the line length violation:
- logger.info( - f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase" - ) + phase_type = 'context' if num_tokens_per_request > 1 else 'generation' + logger.info( + f"Run warmup for batch size={bs}, pure {phase_type} phase" + )
723-761
: Approve CUDA graph warmup reorganization with line length fix needed.The CUDA graph warmup logic is well-organized with sorted batch sizes and proper loop structure. However, line 730 exceeds the 120 character limit.
Apply this diff to fix the line length violation:
- logger.info( - f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes." - ) + num_batch_sizes = len(self._cuda_graph_batch_sizes) + logger.info( + f"Creating CUDA graph instances for {num_batch_sizes} batch sizes." + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
(8 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/llmapi/llm_args.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.690Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
322-322: Line too long (296 > 120)
(E501)
697-697: Line too long (137 > 120)
(E501)
730-730: Line too long (133 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
41-41
: LGTM! Import change aligns with context manager refactoring.The import change from
set_enable_piecewise_cuda_graph_capture_flag
topiecewise_cuda_graph_capture
context manager is consistent with the broader refactoring to improve scoped flag management.
338-338
: LGTM! Proper propagation of piecewise CUDA graph token counts.The
capture_num_tokens
parameter correctly propagates the filtered token counts to the Backend, enabling piecewise CUDA graph capture configuration.
464-482
: LGTM! Clean property-based warmup state management.The
is_warmup
property implementation properly:
- Uses a safe default value in the getter
- Controls torch compile optimization state in the setter
- Manages MoeLoadBalancer iteration info with the renamed
set_iter_info
method- Eliminates the need for direct flag manipulation
This addresses the previous undefined function issues and aligns with the new context manager pattern.
704-722
: LGTM! Simplified autotuner warmup logic.The autotuner warmup is now cleaner with the
no_cuda_graph
context manager and proper cleanup of the profiling cache. The logic flow is much more readable.
762-786
: Excellent addition of piecewise CUDA graph warmup.The new piecewise CUDA graph warmup section properly:
- Uses the
piecewise_cuda_graph_capture
context manager- Runs multiple forward passes for proper graph capture
- Includes memory cleanup after each warmup iteration
- Follows the same pattern as other warmup sections
This is a solid implementation that integrates well with the overall warmup refactoring.
2133-2135
: LGTM! Simplified MoeLoadBalancer retrieval.The simplified logic that always retrieves the moe_load_balancer removes conditional complexity, since the load balancer state is now properly managed through the
is_warmup
property setter.
PR_Github #14293 [ run ] triggered by Bot |
PR_Github #14293 [ run ] completed with state |
/bot run |
PR_Github #14356 [ run ] triggered by Bot |
PR_Github #14356 [ run ] completed with state |
…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yi Zhang <[email protected]>
…UDA graph capturing. Update method names for clarity and add new configuration options for piecewise CUDA graph token capture. Signed-off-by: yizhang-nv <[email protected]>
Signed-off-by: yizhang-nv <[email protected]>
61c5535
to
c838b7e
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
322-327
: Refactor complex initialization logic for better readability.This complex nested ternary operator initialization makes the code difficult to read and maintain, and exceeds the 120-character line length limit.
Consider breaking this into multiple lines with clear variable names:
- self._piecewise_cuda_graph_num_tokens = pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens if pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens else pytorch_backend_config.cuda_graph_batch_sizes if pytorch_backend_config.cuda_graph_batch_sizes else [] - self._piecewise_cuda_graph_num_tokens = [ - i for i in self._piecewise_cuda_graph_num_tokens - if i <= self.max_num_tokens - ] + # Get num_tokens from piecewise config or fallback to cuda_graph_batch_sizes + raw_num_tokens = ( + pytorch_backend_config.torch_compile_piecewise_cuda_graph_num_tokens + or pytorch_backend_config.cuda_graph_batch_sizes + or [] + ) + self._piecewise_cuda_graph_num_tokens = [ + i for i in raw_num_tokens if i <= self.max_num_tokens + ]
464-482
: Fix undefined function and align with new context manager pattern.The property setter uses
set_enable_piecewise_cuda_graph_capture_flag
which is no longer imported and undefined. Since the import was changed to use thepiecewise_cuda_graph_capture
context manager, this property implementation needs to be updated to align with the new scoped flag management pattern.Consider removing the direct flag manipulation from the property setter since the new context manager pattern handles this more cleanly in the warmup method itself:
@is_warmup.setter def is_warmup(self, value: bool): self._is_warmup = value if self._torch_compile_enabled: if value: self._torch_compile_backend.enable_optimization() - set_enable_piecewise_cuda_graph_capture_flag(True) else: self._torch_compile_backend.bypass_optimization() - set_enable_piecewise_cuda_graph_capture_flag(False) moe_load_balancer: MoeLoadBalancer = getattr(self, 'moe_load_balancer', None) if moe_load_balancer is not None: moe_load_balancer.set_iter_info(enable_statistic=not value, enable_update_weights=not value)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
675-786
: Approve warmup method refactoring with minor style improvements.The warmup method refactoring successfully simplifies the logic by using context managers (
no_cuda_graph()
,piecewise_cuda_graph_capture()
) instead of the previouscontextlib.ExitStack
approach. The new piecewise CUDA graph warmup section is well-implemented with proper memory cleanup.Consider breaking up the long lines to improve readability:
- min(self.max_num_tokens // max(bs, 1), - min(available_tokens, self.max_seq_len - 1)) + min( + self.max_num_tokens // max(bs, 1), + min(available_tokens, self.max_seq_len - 1) + )- logger.info( - f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes." - ) + logger.info( + f"Creating CUDA graph instances for " + f"{len(self._cuda_graph_batch_sizes)} batch sizes." + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
tensorrt_llm/_torch/compilation/backend.py
(4 hunks)tensorrt_llm/_torch/compilation/piecewise_optimizer.py
(7 hunks)tensorrt_llm/_torch/compilation/utils.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
(11 hunks)tensorrt_llm/_torch/modules/multi_stream_utils.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)tensorrt_llm/_torch/utils.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tests/unittest/_torch/modules/test_moe_load_balancer.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
- tensorrt_llm/_torch/utils.py
- tensorrt_llm/_torch/compilation/utils.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
- tests/unittest/_torch/modules/test_moe_load_balancer.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/compilation/piecewise_optimizer.py
- tensorrt_llm/_torch/modules/multi_stream_utils.py
- tensorrt_llm/_torch/compilation/backend.py
- tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
- tensorrt_llm/llmapi/llm_args.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
322-322: Line too long (296 > 120)
(E501)
697-697: Line too long (137 > 120)
(E501)
730-730: Line too long (133 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
41-41
: LGTM!The import change from
set_enable_piecewise_cuda_graph_capture_flag
topiecewise_cuda_graph_capture
context manager aligns well with the refactoring goal of replacing direct flag manipulation with scoped context management.
361-361
: LGTM!The initialization of the new
is_warmup
property provides a clean replacement for the previousin_warmup
boolean attribute while leveraging the property's setter to handle related state management.
489-494
: LGTM!The context manager now properly uses the
is_warmup
property, which automatically handles torch compile backend and MOE load balancer state changes through the property setter. This is a clean improvement over the previous boolean attribute approach.
1523-1523
: LGTM!The update to use
self.is_warmup
property is consistent with the warmup state management refactoring.
2133-2134
: LGTM!The simplification of MOE load balancer retrieval by always using
getattr
without conditional logic is a clean improvement that reduces complexity in the forward method.
PR_Github #14361 [ run ] triggered by Bot |
PR_Github #14361 [ run ] completed with state |
Signed-off-by: Yi Zhang <[email protected]>
/bot run |
PR_Github #14405 [ run ] triggered by Bot |
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Style
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.