-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][chore] remove executor config in kv cache creator #7526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][chore] remove executor config in kv cache creator #7526
Conversation
/bot run |
PR_Github #17651 [ run ] triggered by Bot |
PR_Github #17651 [ run ] completed with state |
/bot run |
PR_Github #17772 [ run ] triggered by Bot |
PR_Github #17772 [ run ] completed with state |
e5fd5a8
to
2400424
Compare
/bot run |
PR_Github #17836 [ run ] triggered by Bot |
/bot run |
PR_Github #17839 [ run ] triggered by Bot |
PR_Github #17836 [ run ] completed with state |
PR_Github #17839 [ run ] completed with state |
efe4893
to
13add21
Compare
/bot run |
PR_Github #17858 [ run ] triggered by Bot |
PR_Github #17858 [ run ] completed with state |
/bot run |
PR_Github #17873 [ run ] triggered by Bot |
PR_Github #17873 [ run ] completed with state |
/bot run |
PR_Github #17902 [ run ] triggered by Bot |
PR_Github #17902 [ run ] completed with state |
/bot run |
PR_Github #17912 [ run ] triggered by Bot |
PR_Github #17912 [ run ] completed with state |
Signed-off-by: leslie-fang25 <[email protected]>
13add21
to
9177411
Compare
/bot run |
PR_Github #18106 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughRefactors KvCacheCreator to accept explicit configuration parameters instead of a single ExecutorConfig, updates all internal references accordingly, and adjusts py_executor_creator to construct KvCacheCreator with granular args and to synchronize max_seq_len between the creator and executor_config during build and estimation flows. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller as create_py_executor
participant Creator as KvCacheCreator
participant ExecCfg as executor_config
participant Sampler as Sampler
participant KVMgr as KVCacheManager
Note over Caller,Creator: Construction with explicit params (no ExecutorConfig)
Caller->>Creator: __init__(model_engine, draft_model_engine, mapping, net_max_seq_len, kv_connector_manager, max_num_tokens, max_beam_width, tokens_per_block, max_seq_len, max_batch_size, kv_cache_config, pytorch_backend_config, speculative_config)
activate Creator
Note over Creator,KVMgr: Build managers may adjust max_seq_len internally
Creator->>KVMgr: build_managers(tokens_per_block, max_seq_len, batch/beam limits, configs)
KVMgr-->>Creator: possibly_updated_max_seq_len
Creator->>Creator: self._max_seq_len = possibly_updated_max_seq_len
deactivate Creator
Caller->>ExecCfg: ExecCfg.max_seq_len = Creator._max_seq_len
Caller->>Sampler: update_sampler_max_seq_len(ExecCfg.max_seq_len)
break Estimation flow
Caller->>Creator: set Creator._max_seq_len = trial_max_seq_len
Caller->>KVMgr: rebuild_managers(...)
KVMgr-->>Caller: updated_max_seq_len
Caller->>Creator: Creator._max_seq_len = updated_max_seq_len
Caller->>ExecCfg: ExecCfg.max_seq_len = Creator._max_seq_len
Caller->>Sampler: update_sampler_max_seq_len(ExecCfg.max_seq_len)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (1 passed, 2 warnings)❌ Failed Checks (2 warnings)
✅ Passed Checks (1 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Flag potential breaking changes that are not documented:
1. Identify changes to public APIs/exports, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints (including removed/renamed items and changes to types, required params, return values, defaults, or behavior).
2. Ignore purely internal/private changes (e.g., code not exported from package entry points or marked internal).
3. Verify documentation exists: a "Breaking Change" section in the PR description and updates to CHANGELOG.md. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
515-519
: Ensure robust max_seq_len synchronizationThe synchronization between
kv_cache_creator._max_seq_len
andexecutor_config.max_seq_len
appears correct. However, there's a potential risk ifkv_cache_creator._max_seq_len
is uninitialized or modified unexpectedly. Consider adding a validation check.kv_cache_creator.build_managers(resources, estimating_kv_cache) -# Originally, executor_config.max_seq_len might be changed inside build_managers and used -# below in create_py_executor_instance. Since now, we are changing -# kv_cache_creator._max_seq_len instead, restore executor_config.max_seq_len. -executor_config.max_seq_len = kv_cache_creator._max_seq_len +# Originally, executor_config.max_seq_len might be changed inside build_managers and used +# below in create_py_executor_instance. Since now, we are changing +# kv_cache_creator._max_seq_len instead, restore executor_config.max_seq_len. +if hasattr(kv_cache_creator, '_max_seq_len'): + executor_config.max_seq_len = kv_cache_creator._max_seq_len +else: + logger.warning("kv_cache_creator._max_seq_len not set, keeping original executor_config.max_seq_len")
579-587
: Add validation for max_seq_len restorationWhen restoring
executor_config.max_seq_len
after estimation, the code assumeskv_cache_creator._max_seq_len
is properly set. Consider adding validation to ensure the value is reasonable.# Before estimating KV cache size, a minimal KV cache has been allocated using # create_kv_cache_manager above, which caps kv_cache_creator.max_seq_len. Restoring # the original value before creating the final KV cache. kv_cache_creator._max_seq_len = max_seq_len kv_cache_creator.build_managers(resources, False) # Originally, executor_config.max_seq_len might be changed again inside build_managers # Since now, we are changing kv_cache_creator.max_seq_len instead. # Restore executor_config.max_seq_len which has been used in create_py_executor_instance -executor_config.max_seq_len = kv_cache_creator._max_seq_len +if hasattr(kv_cache_creator, '_max_seq_len') and kv_cache_creator._max_seq_len > 0: + executor_config.max_seq_len = kv_cache_creator._max_seq_len +else: + logger.error(f"Invalid kv_cache_creator._max_seq_len: {getattr(kv_cache_creator, '_max_seq_len', 'not set')}") + executor_config.max_seq_len = max_seq_lentensorrt_llm/_torch/pyexecutor/_util.py (2)
361-361
: Consider replacing assert with proper error handlingUsing
assert
for validation can be problematic in production as assertions can be disabled with Python's-O
flag. Consider using an explicit check with a proper exception.-assert model_engine.model.model_config.is_generation, "Only construct KV cache for generation models." +if not model_engine.model.model_config.is_generation: + raise ValueError("Only construct KV cache for generation models.")
155-158
: Verify vocabulary bounds in dummy request generationThe random token generation could theoretically produce the same token repeatedly. While unlikely to cause issues, consider using
random.sample
for guaranteed uniqueness if needed.input_tokens = [ - random.randint(0, vocab_size - 1) for _ in range(input_seq_len) + random.randint(0, vocab_size - 1) for _ in range(input_seq_len) ] +# Note: Using random.randint is fine here as duplicate tokens are acceptable in input sequences
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/_util.py
(11 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/_util.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/_util.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/_util.py
🧠 Learnings (2)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (3)
tokens_per_block
(581-585)max_seq_len
(561-571)max_seq_len
(574-578)tensorrt_llm/_torch/pyexecutor/_util.py (2)
try_prepare_estimation
(208-219)build_managers
(485-502)
tensorrt_llm/_torch/pyexecutor/_util.py (5)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
PyTorchModelEngine
(263-2464)tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig
(961-1095)tensorrt_llm/_torch/pyexecutor/config.py (1)
PyTorchConfig
(16-114)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
KVCacheManager
(142-1011)tensorrt_llm/_torch/model_config.py (1)
get_bindings_model_config
(431-528)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/_util.py
361-361: Use of assert
detected
(S101)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
495-509
: Good refactor: explicit parameters improve clarityThe change from passing
executor_config
to passing explicit configuration parameters toKvCacheCreator
improves code clarity and reduces coupling between the executor configuration and KV cache creation logic.tensorrt_llm/_torch/pyexecutor/_util.py (3)
46-62
: LGTM! Clean refactor to explicit parametersThe refactoring from
ExecutorConfig
to explicit parameters in theKvCacheCreator
constructor improves the API clarity and reduces coupling. The parameter names are descriptive and the keyword-only arguments enforce clear usage.
479-482
: Good separation of concerns with max_seq_len updateThe logic for updating
self._max_seq_len
when the KV cache manager modifies it is well-placed. This ensures that any changes made by the KvCacheManager are captured and can be propagated back to the executor config.
217-219
: Clean update to use new internal fieldThe change to update
self._kv_cache_config.max_tokens
instead of the executor config field is consistent with the refactoring approach.
PR_Github #18106 [ run ] completed with state |
/bot run |
PR_Github #18151 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR_Github #18151 [ run ] completed with state |
/bot run |
PR_Github #18275 [ run ] triggered by Bot |
/bot run |
PR_Github #18311 [ run ] triggered by Bot |
PR_Github #18311 [ run ] completed with state |
/bot run |
PR_Github #18347 [ run ] triggered by Bot |
PR_Github #18347 [ run ] completed with state |
Signed-off-by: leslie-fang25 <[email protected]>
Signed-off-by: leslie-fang25 <[email protected]>
Summary by CodeRabbit
Refactor
Chores
Description
Remove executor config in kv cache creator
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.