Skip to content

[TRTLLM-5930][doc] 1.0 Documentation. #6696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release/1.0
Choose a base branch
from

Conversation

nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Aug 7, 2025

1.0 Documentation.

Summary by CodeRabbit

  • Documentation
    • Added extensive new documentation covering performance analysis, benchmarking, multi-head attention, disaggregated serving, feature compatibility, KV cache system, long sequence optimizations, LoRA fine-tuning, multi-modality, overlap scheduler, paged attention and scheduling, parallel strategies, quantization, sampling, speculative decoding, checkpoint loading, and model support.
    • Introduced guides for adding new models, deploying on Kubernetes, and reorganized the documentation index for improved navigation.
    • Included new installation and deployment guides, CLI references, and detailed examples for advanced features.
    • Updated deployment recipes with new Docker image versions and expanded instructions for Llama 3.3-70B and Llama4 Scout models on NVIDIA hardware.
    • Enhanced Quick Start Guide structure and references; added Docker usage instructions for trtllm-server.
    • Refined documentation index and support matrix references for clarity and usability.
    • Added detailed documentation for checkpoint loading system with modular plugin architecture and public APIs.

@nv-guomingz nv-guomingz requested a review from a team as a code owner August 7, 2025 08:38
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

📝 Walkthrough

Walkthrough

This pull request adds a large set of new documentation files and reorganizes the documentation structure for TensorRT-LLM. The changes introduce detailed guides covering performance analysis, benchmarking, advanced features (such as attention, quantization, sampling, speculative decoding, parallel strategies, disaggregated serving, and multi-modality), as well as model support and instructions for adding new models. The documentation index is restructured to improve navigation and visibility of these topics.

Changes

Cohort / File(s) Change Summary
Performance Analysis & Benchmarking Docs
docs/source/developer-guide/perf-analysis.md, docs/source/developer-guide/perf-benchmarking.md
Added comprehensive guides on performance analysis (Nsight Systems, PyTorch profiler) and benchmarking (trtllm-bench CLI), including configuration, dataset preparation, quantization, and multi-modal benchmarking.
Feature Documentation
docs/source/features/attention.md, docs/source/features/disagg-serving.md, docs/source/features/feature-combination-matrix.md, docs/source/features/long-sequence.md, docs/source/features/multi-modality.md, docs/source/features/overlap-scheduler.md, docs/source/features/paged-attention-ifb-scheduler.md, docs/source/features/parallel-strategy.md, docs/source/features/quantization.md, docs/source/features/sampling.md, docs/source/features/speculative-decoding.md
Added new markdown files documenting key features: attention mechanisms, disaggregated serving, feature compatibility matrix, long sequence optimizations, multi-modality, overlap scheduler, paged attention and IFB scheduling, parallel strategies, quantization, sampling, and speculative decoding. Each file provides detailed explanations, usage, configuration, and internal architecture where relevant.
KV Cache and LoRA Documentation
docs/source/features/kvcache.md, docs/source/features/lora.md
Added detailed documentation on KV cache architecture and reuse mechanisms, and comprehensive LoRA feature usage including configuration, integration, and benchmarking.
Model Documentation
docs/source/models/adding-new-model.md, docs/source/models/supported-models.md
Added documentation on supported models (with feature matrix) and a step-by-step guide for adding new models to the PyTorch backend.
Deployment & Example Docs
docs/source/examples/dynamo_k8s_example.rst, docs/source/installation/index.rst, docs/source/deployment-guide/index.rst, docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md, docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md, docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
Added a Kubernetes deployment example for Dynamo, installation section index, and multiple deployment guide recipes for DeepSeek R1, Llama 3.3-70B, and Llama4 Scout models with updated Docker tags and URLs.
Checkpoint Loading Documentation
docs/source/features/checkpoint-loading.md
Added detailed documentation on a modular, plugin-like checkpoint loading architecture with components for loaders, mappers, and configuration parsing, including examples and registration decorators.
Command Documentation
docs/source/commands/trtllm-eval.rst
Added documentation for the trtllm-eval command describing supported evaluation tasks, usage instructions, and examples.
Documentation Structure
docs/source/index.rst
Restructured the documentation index: reorganized, renamed, and hid some sections; introduced new toctrees for features and developer guides; added quick links to external resources.
Miscellaneous Documentation Updates
docs/source/conf.py, docs/source/overview.md, docs/source/quick-start-guide.md, docs/source/reference/support-matrix.md
Added a new admonition for Docker usage in conf.py; swapped references in overview; restructured quick start guide sections and references; removed an unused label anchor in support matrix.

Sequence Diagram(s)

Not applicable: The changes are exclusively documentation additions and restructuring, with no new or modified control flow in code.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • laikhtewari
  • chenopis
  • amukkara
  • litaotju
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (2)
docs/source/features/lora.md (1)

1-2: Placeholder documentation needs real content before 1.0 release

lora.md is a stub with only a single comment. Ship-blocking public docs should, at minimum, state scope, usage and configuration, or be excluded from the published tree until ready.

Happy to draft an outline or migrate internal notes into this file if that helps.

docs/source/features/kvcache.md (1)

1-2: KV-cache doc is still a stub

Same situation as lora.md. Either flesh it out (motivation, supported layouts, tuning flags, links to related scheduler docs) or drop it from the 1.0 docs build to avoid empty pages.

♻️ Duplicate comments (1)
docs/source/features/speculative-decoding.md (1)

84-90: Same import-path inconsistency as above

If the package exposes only one canonical module name, adjust this snippet too.

🧹 Nitpick comments (16)
docs/source/features/overlap-scheduler.md (1)

18-18: Provide a concrete disable example

The text mentions disable_overlap_scheduler=True, but readers will look for a code/CLI snippet. Adding one line improves discoverability:

 Enabled by default. To disable, set `disable_overlap_scheduler=True` in the configuration.
+
+```python
+llm = LLM(model="...", disable_overlap_scheduler=True)
+```
+
+# or via CLI
+```bash
+trtllm-serve --disable-overlap-scheduler ...
+```
docs/source/features/quantization.md (1)

14-18: Minor style tweak: prefer fenced code over triple back-ticks in markdown

Sphinx’s myst-parser treats ```bash fences fine, but aligning with the rest of the docs which use triple back-ticks with language tag keeps consistency:

-```bash
+```bash
 # (no functional change, just consistency)
docs/source/examples/dynamo_k8s_example.rst (1)

16-18: Minor wording / formatting nit.

DynamoDeploymentGraph yamlYAML is an acronym and should be capitalised; also wrap the file name in back-ticks for better rendering.

-You can use the `DynamoDeploymentGraph yaml <...>`_
+You can use the `DynamoDeploymentGraph YAML <...>`_
docs/source/features/feature-combination-matrix.md (1)

10-10: Unexplained acronym “MTP”.

“MTP” appears in both headers and rows without expansion. Add a footnote or parenthetical description so readers can understand the feature without external context.

docs/source/models/supported-models.md (1)

44-45: Footnote text references undefined term “MLA”.

Footnote [^1] says “Chunked Prefill for MLA can only be enabled…”, but MLA is never defined in this document. Clarify or replace with the intended term (perhaps Multi-Query Attention or Mistral Large Attention?).

docs/source/developer-guide/perf-analysis.md (1)

69-70: Missing period in abbreviation.

In American English, “etc.” requires a period.

-… debugging NVTX markers, etc:
+… debugging NVTX markers, etc.:
docs/source/features/parallel-strategy.md (1)

51-54: Typo in bullet label (“Hybrid ETP”).

Hybrid ETP looks like a mix-up of “Hybrid” and “ETP”. For clarity and searchability, keep the terminology consistent with the earlier description Hybrid (TP × EP).

-* **Hybrid ETP** - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency.
+* **Hybrid (TP × EP)** - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency.
docs/source/features/disagg-serving.md (4)

41-41: Word choice: “rationale” not “rational”.

- for more details about the rational and design considerations
+ for more details about the rationale and design considerations

108-115: Add language tag to fenced code block.

The block is a shell session; without a language tag markdownlint (MD040) fails and syntax highlighting is lost.

-```
+# ```bash

130-148: Add language tag to YAML snippet.

-```
+# ```yaml

150-152: Spelling mistake: “serfvers”.

-when routing requests to the generation serfvers,
+when routing requests to the generation servers,
docs/source/features/long-sequence.md (1)

15-21: Incorrect language tag (bash) for Python snippet.

Using the correct tag improves rendering and linters.

-```bash
+```python
docs/source/developer-guide/perf-benchmarking.md (2)

10-13: Grammar: “provides the follows” should be “provides the following”

Minor but jarring in the first paragraph.

-`trtllm-bench` provides the follows:
+`trtllm-bench` provides the following:

459-460: Spelling error: “quantizaton”

-…force the KV cache quantizaton…
+…force the KV-cache quantization…
docs/source/features/speculative-decoding.md (1)

236-238: Typo: “drat tokens”

-…are drat tokens.
+…are draft tokens.
docs/source/features/attention.md (1)

262-264: Non-standard Sphinx role source: – likely renders as plain text

[KVCacheManager](source:…) uses a custom role not defined elsewhere; Sphinx will not resolve it and the link becomes literal text.
Replace with a standard role, e.g. ``:py:class:`tensorrt_llm._torch.pyexecutor.resource_manager.KVCacheManager``` or a plain markdown link.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53f94a4 and b99ca2f.

⛔ Files ignored due to path filters (2)
  • docs/source/features/media/feat_long_seq_chunked_attention.png is excluded by !**/*.png
  • docs/source/features/media/feat_long_seq_sliding_win_attn.png is excluded by !**/*.png
📒 Files selected for processing (20)
  • docs/source/developer-guide/perf-analysis.md (1 hunks)
  • docs/source/developer-guide/perf-benchmarking.md (1 hunks)
  • docs/source/examples/dynamo_k8s_example.rst (1 hunks)
  • docs/source/features/attention.md (1 hunks)
  • docs/source/features/disagg-serving.md (1 hunks)
  • docs/source/features/feature-combination-matrix.md (1 hunks)
  • docs/source/features/kvcache.md (1 hunks)
  • docs/source/features/long-sequence.md (1 hunks)
  • docs/source/features/lora.md (1 hunks)
  • docs/source/features/multi-modality.md (1 hunks)
  • docs/source/features/overlap-scheduler.md (1 hunks)
  • docs/source/features/paged-attention-ifb-scheduler.md (1 hunks)
  • docs/source/features/parallel-strategy.md (1 hunks)
  • docs/source/features/quantization.md (1 hunks)
  • docs/source/features/sampling.md (1 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
  • docs/source/index.rst (3 hunks)
  • docs/source/installation/index.rst (1 hunks)
  • docs/source/models/adding-new-model.md (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/examples/dynamo_k8s_example.rst
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/sampling.md
  • docs/source/index.rst
  • docs/source/models/adding-new-model.md
  • docs/source/features/multi-modality.md
  • docs/source/features/parallel-strategy.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/features/quantization.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/sampling.md
  • docs/source/index.rst
  • docs/source/models/supported-models.md
  • docs/source/models/adding-new-model.md
  • docs/source/features/disagg-serving.md
  • docs/source/features/multi-modality.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/parallel-strategy.md
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/long-sequence.md
  • docs/source/index.rst
  • docs/source/features/lora.md
  • docs/source/models/adding-new-model.md
  • docs/source/features/multi-modality.md
  • docs/source/features/attention.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/parallel-strategy.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/features/long-sequence.md
  • docs/source/features/sampling.md
  • docs/source/features/multi-modality.md
  • docs/source/features/parallel-strategy.md
  • docs/source/features/paged-attention-ifb-scheduler.md
  • docs/source/features/speculative-decoding.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • docs/source/features/disagg-serving.md
🪛 LanguageTool
docs/source/developer-guide/perf-analysis.md

[style] ~69-~69: In American English, abbreviations like “etc.” require a period.
Context: ...g, such as GIL, debugging NVTX markers, etc: ```bash #!/bin/bash # Prepare datase...

(ETC_PERIOD)

docs/source/features/disagg-serving.md

[grammar] ~151-~151: Ensure spelling is correct
Context: ...when routing requests to the generation serfvers, the disaggregated server will mark the...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~202-~202: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. * `TRTLLM_KVCACHE_SEND_MAX_CONCURRE...

(MORE_THAN_EXCEEDS)


[style] ~204-~204: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. There are some other useful environ...

(MORE_THAN_EXCEEDS)


[style] ~208-~208: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...

(MAKE_SURE_ENSURE)

docs/source/features/multi-modality.md

[style] ~3-~3: Consider using a more concise synonym.
Context: ... in TensorRT-LLM TensorRT-LLM supports a variety of multimodal models, enabling efficient i...

(A_VARIETY_OF)

docs/source/features/attention.md

[style] ~68-~68: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...osition of each token in each sequence. May be None if positional embedding is appl...

(MISSING_IT_THERE)


[style] ~68-~68: This phrase is redundant. Consider using “outside”.
Context: ...None if positional embedding is applied outside of the backend. | | request_ids | List[int...

(OUTSIDE_OF)


[style] ~131-~131: ‘in proportion to’ might be wordy. Consider a shorter alternative.
Context: ...int is significant (grows quadratically in proportion to the sequence length). The `TrtllmAtten...

(EN_WORDINESS_PREMIUM_IN_PROPORTION_TO)


[style] ~158-~158: As a shorter alternative for ‘able to’, consider using “can”.
Context: ... attention in TensorRT-LLM. That kernel is able to apply pre-processing on the Q, K, and V...

(BE_ABLE_TO)


[style] ~168-~168: To elevate your writing, try using a synonym here.
Context: ...efinition of 'small' in that context is hard to quantify because it depends on the m...

(HARD_TO)


[style] ~254-~254: Consider using a synonym to be more concise.
Context: ...tion of many output tokens, it may take a lot of steps to reach that point). #### Paged...

(A_LOT_OF)


[style] ~312-~312: Consider removing “of” to be more concise
Context: ...implementation of StreamingLLM. It uses all of the tokens for computation and only saves `...

(ALL_OF_THE)

docs/source/developer-guide/perf-benchmarking.md

[style] ~13-~13: Consider using a more concise synonym.
Context: ...uild tuned engines for benchmarking for a variety of models and platforms. - An entirely Pyt...

(A_VARIETY_OF)


[style] ~147-~147: Consider a more concise word here.
Context: ...plete JSON entry is on every line. ``` In order to prepare a synthetic dataset, you can us...

(IN_ORDER_TO_PREMIUM)


[style] ~325-~325: ‘due to the fact that’ might be wordy. Consider a shorter alternative.
Context: ...re texts and not tokenized ids. This is due to the fact that the prompt and the media (image/video...

(EN_WORDINESS_PREMIUM_DUE_TO_THE_FACT_THAT)


[grammar] ~459-~459: Ensure spelling is correct
Context: ...If you would like to force the KV cache quantizaton, you can specify the following in the Y...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/source/features/paged-attention-ifb-scheduler.md

[style] ~123-~123: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...x num token budget of 12, the scheduler is able to schedule two additional requests, Reque...

(BE_ABLE_TO)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...support. For long-context cases this is extremely important, because setting extremely large values...

(EN_WEAK_ADJECTIVE)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...is extremely important, because setting extremely large values of max-num tokens takes away fro...

(EN_WEAK_ADJECTIVE)

docs/source/features/speculative-decoding.md

[style] ~11-~11: Consider removing “of” to be more concise
Context: ...lementation. The table below enumerates all of the algorithm/model combinations that are s...

(ALL_OF_THE)


[style] ~94-~94: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s currently only supported by Deepseek. MTP can be tuned with the following configu...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Consider removing “of” to be more concise
Context: ...ts of a speculative decoding algorithm. All of the interfaces are defined in [`_torch/spec...

(ALL_OF_THE)


[style] ~161-~161: Consider removing “of” to be more concise
Context: ... is problematic. Ideally, we would move all of the kernel dispatching logic to a lower lev...

(ALL_OF_THE)


[style] ~187-~187: Consider removing “of” to be more concise
Context: ...et model as a submodule. 7. Drafter: All of the logic required to actually produce draf...

(ALL_OF_THE)

🪛 markdownlint-cli2 (0.17.2)
docs/source/developer-guide/perf-analysis.md

13-13: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


18-18: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

docs/source/features/disagg-serving.md

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


141-141: Link fragments should be valid

(MD051, link-fragments)

docs/source/features/overlap-scheduler.md

3-3: Link fragments should be valid
Expected: #motivation; Actual: #Motivation

(MD051, link-fragments)


4-4: Link fragments should be valid
Expected: #kv-cache-exchange; Actual: #KV-Cache-Exchange

(MD051, link-fragments)


5-5: Link fragments should be valid
Expected: #multi-backend-support; Actual: #Multi-backend-Support

(MD051, link-fragments)


6-6: Link fragments should be valid
Expected: #overlap-optimization; Actual: #Overlap-Optimization

(MD051, link-fragments)


7-7: Link fragments should be valid
Expected: #cache-layout-transformation; Actual: #Cache-Layout-Transformation

(MD051, link-fragments)


8-8: Link fragments should be valid
Expected: #usage; Actual: #Usage

(MD051, link-fragments)


10-10: Link fragments should be valid
Expected: #dynamo; Actual: #Dynamo

(MD051, link-fragments)


11-11: Link fragments should be valid
Expected: #environment-variables; Actual: #Environment-Variables

(MD051, link-fragments)


12-12: Link fragments should be valid
Expected: #troubleshooting-and-faq; Actual: #Troubleshooting-and-FAQ

(MD051, link-fragments)


20-20: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


21-21: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

docs/source/features/speculative-decoding.md

33-33: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


34-34: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


55-55: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


56-56: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


77-77: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


79-79: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


80-80: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


81-81: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


96-96: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


97-97: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


98-98: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


99-99: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


100-100: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


115-115: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


116-116: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


117-117: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


132-132: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


133-133: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


134-134: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


135-135: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


139-139: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


194-194: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


195-195: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


198-198: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


199-199: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
docs/source/installation/index.rst (1)

6-12: Verify that the referenced sub-pages actually exist

toctree links (linux, build-from-source-linux, containers) will generate Sphinx warnings if the target .rst files are missing or mis-named (e.g., .md vs .rst). Please confirm file presence or adjust paths.

docs/source/features/quantization.md (1)

6-10: Double-check that the demo model actually exists on Hugging Face

nvidia/Llama-3.1-8B-Instruct-FP8 is used as an example. If the card is private or not yet pushed, users will hit a 404. Either switch to a public model (e.g., nvidia/Llama-3-8B-Instruct-FP8) or annotate that it is placeholder text.

docs/source/features/multi-modality.md (1)

35-39: Path in quick-start snippet likely incorrect.

The example points to examples/llm-api/quickstart_multimodal.py, whereas all existing examples use an underscore (llm_api) rather than a hyphen.
Double-check the actual path in the repo (should be examples/llm_api/quickstart_multimodal.py) to avoid a “file not found” experience for users.

docs/source/features/feature-combination-matrix.md (1)

5-6: Matrix row missing compatibility data.

Row “Overlap Scheduler” contains only “---” in the diagonal but leaves every other column blank. This renders as empty cells that users may interpret as “unknown”, whereas other rows use explicit “Yes/No/Untested”.
Populate the row or mark the cells as Untested to stay consistent.

docs/source/models/supported-models.md (1)

17-18: Verify model class spelling.

MllamaForConditionalGeneration looks like a typo (double-l). In the codebase the class is MLlamaForConditionalGeneration. Please confirm and correct to avoid confusing users.

Comment on lines +72 to +88
#!/bin/bash

# Prepare dataset for the benchmark
python3 benchmarks/cpp/prepare_dataset.py \
--tokenizer=${MODEL_PATH} \
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt

# Benchmark and profile
TLLM_PROFILE_START_STOP=100-150 nsys profile \
-o trace -f true \
-t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
--cuda-graph-trace node \
-e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
--trace-fork-before-exec=true \
trtllm-bench \ # or trtllm-serve command
--model deepseek-ai/DeepSeek-V3 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Example command contains an escaped newline that breaks copy-paste.

The back-slash after trtllm-bench escapes the following space, so # or trtllm-serve command becomes part of the command and causes a parse error.

Consider either removing the comment or placing it on its own line:

-  trtllm-bench \ # or trtllm-serve command
+  # Replace with `trtllm-serve` if desired
+  trtllm-bench \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
# Prepare dataset for the benchmark
python3 benchmarks/cpp/prepare_dataset.py \
--tokenizer=${MODEL_PATH} \
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
# Benchmark and profile
TLLM_PROFILE_START_STOP=100-150 nsys profile \
-o trace -f true \
-t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
--cuda-graph-trace node \
-e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
--trace-fork-before-exec=true \
trtllm-bench \ # or trtllm-serve command
--model deepseek-ai/DeepSeek-V3 \
#!/bin/bash
# Prepare dataset for the benchmark
python3 benchmarks/cpp/prepare_dataset.py \
--tokenizer=${MODEL_PATH} \
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
# Benchmark and profile
TLLM_PROFILE_START_STOP=100-150 nsys profile \
-o trace -f true \
-t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
--cuda-graph-trace node \
-e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
--trace-fork-before-exec=true \
# Replace with `trtllm-serve` if desired
trtllm-bench \
--model deepseek-ai/DeepSeek-V3 \
🤖 Prompt for AI Agents
In docs/source/developer-guide/perf-analysis.md around lines 72 to 88, the
example bash command includes a backslash after 'trtllm-bench' that escapes the
following space, causing the comment '# or trtllm-serve command' to be
interpreted as part of the command and resulting in a parse error. To fix this,
remove the backslash after 'trtllm-bench' or move the comment to its own
separate line above or below the command to avoid breaking the command syntax.

Comment on lines 4 to 7
.. .. toctree::
.. :maxdepth: 2
.. :caption: Scripts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Malformed toctree directive.

The directive currently starts with .. .. toctree::; the extra .. causes Sphinx to treat the entire block as a literal, so no toctree is generated.

-.. .. toctree::
+.. toctree::

Indent the subsequent options two spaces under the directive per reStructuredText rules.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In docs/source/examples/dynamo_k8s_example.rst around lines 4 to 7, the toctree
directive is malformed because it starts with two sets of `..`, causing Sphinx
to treat it as a literal block. Remove the extra `..` so the directive starts
with a single `.. toctree::` and ensure the following options are indented
exactly two spaces under the directive to comply with reStructuredText syntax.

Comment on lines +25 to +28
## Model Support Matrix

Please refer to the latest multimodal [support matrix](/docs/source/reference/support-matrix.md).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Broken or brittle link to support-matrix.

The relative URL (/docs/source/reference/support-matrix.md) will break once the docs are rendered by Sphinx because links should be expressed via :doc: (internal) or fully qualified external URLs.
Replace with an internal cross-reference, e.g.:

-Please refer to the latest multimodal [support matrix](/docs/source/reference/support-matrix.md).
+Please refer to the latest multimodal :doc:`support matrix <reference/support-matrix>`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Model Support Matrix
Please refer to the latest multimodal [support matrix](/docs/source/reference/support-matrix.md).
## Model Support Matrix
Please refer to the latest multimodal :doc:`support matrix <reference/support-matrix>`.
🤖 Prompt for AI Agents
In docs/source/features/multi-modality.md around lines 25 to 28, the link to the
support matrix uses a relative URL that will break when rendered by Sphinx.
Replace the relative URL with a Sphinx internal cross-reference using the :doc:
role, for example :doc:`reference/support-matrix`, to ensure the link works
correctly in the generated documentation.

Comment on lines +76 to +79
constraints on input length. To enable this feature, the FMHA paged kv-cache also
needs to be enabled. Except for the last chunk, the size of each context chunk needs to be an integer multiple of the kv-cache block size. Refer to
[the performance best practices](../performance/perf-best-practices.md#chunked-context) for usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Out-of-tree link likely to break (../performance/perf-best-practices.md).

The performance/ section was removed in this restructure; use the new path under developer-guide/ or update the anchor.

🤖 Prompt for AI Agents
In docs/source/features/paged-attention-ifb-scheduler.md around lines 76 to 79,
the link to the performance best practices file uses an outdated relative path
../performance/perf-best-practices.md which no longer exists. Update this link
to point to the new location under developer-guide/ and verify or update the
anchor to match the current section heading in the new file.

Comment on lines +74 to +80
To enable guided decoding, you must:

1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification
* Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `strucutral_tag`).
3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Typo in parameter name (strucutral_tag) breaks guided-decoding docs

Line 78 misspells structural_tag, which is the keyword accepted by GuidedDecodingParams. Readers copying the snippet will hit TypeError: got an unexpected keyword argument.

-    * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `strucutral_tag`).
+    * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
To enable guided decoding, you must:
1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification
* Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `strucutral_tag`).
3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object
To enable guided decoding, you must:
1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
2. Create a [`GuidedDecodingParams`](../../../../tensorrt_llm/sampling_params.py#L14) object with the desired format specification
* Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object
🤖 Prompt for AI Agents
In docs/source/features/sampling.md around lines 74 to 80, the parameter name
`strucutral_tag` is misspelled and should be corrected to `structural_tag` to
match the expected keyword argument in GuidedDecodingParams. Update the
documentation to fix this typo to prevent users from encountering a TypeError
when using the guided decoding feature.

Comment on lines +40 to +43
```python
from tensorrt_llm.llmapi import DraftTargetDecodingConfig

speculative_config = DraftTargetDecodingConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Inconsistent import path (llmapi vs llm_api)

Here you use tensorrt_llm.llmapi, but later examples import from tensorrt_llm.llm_api. Only one exists in the package. Pick one consistently or users will copy-paste code that fails to import.

Run:


🏁 Script executed:

#!/bin/bash
# Verify which sub-module actually exists
fd -t f __init__.py | grep -E "llmapi|llm_api"

Length of output: 195


🏁 Script executed:

#!/bin/bash
# Search for any imports using the incorrect llm_api path in documentation
rg -n "tensorrt_llm\.llm_api" docs/source

Length of output: 353


Ensure consistent use of tensorrt_llm.llmapi in speculative-decoding documentation

The repository defines the module as tensorrt_llm.llmapi; there is no llm_api package. Please update all imports in docs/source/features/speculative-decoding.md so they match the existing path:

• Lines 40–43 (already correct):

from tensorrt_llm.llmapi import DraftTargetDecodingConfig

• Line 84:

-from tensorrt_llm.llm_api import NGramDecodingConfig
+from tensorrt_llm.llmapi import NGramDecodingConfig

• Line 105:

-from tensorrt_llm.llm_api import MTPDecodingConfig
+from tensorrt_llm.llmapi import MTPDecodingConfig

• Line 120:

-from tensorrt_llm.llm_api import UserProvidedDecodingConfig
+from tensorrt_llm.llmapi import UserProvidedDecodingConfig

This will prevent import errors when users copy-paste these examples.

🤖 Prompt for AI Agents
In docs/source/features/speculative-decoding.md at lines 84, 105, and 120,
update all import statements that currently use 'tensorrt_llm.llm_api' to use
the correct module path 'tensorrt_llm.llmapi' to ensure consistency with the
repository's defined module and prevent import errors.

Comment on lines +74 to +70
features/overlap-scheduler.md
features/paged-attention-ifb-scheduling.md
features/parallel-strategy.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Broken toctree link: file name mismatch (paged-attention-ifb-scheduling.md).

The toctree entry points to features/paged-attention-ifb-scheduling.md, but the actual file added in this PR is features/paged-attention-ifb-scheduler.md.
Sphinx will raise a “file not found” error during the docs build.

-   features/paged-attention-ifb-scheduling.md
+   features/paged-attention-ifb-scheduler.md
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
features/overlap-scheduler.md
features/paged-attention-ifb-scheduling.md
features/parallel-strategy.md
features/overlap-scheduler.md
features/paged-attention-ifb-scheduler.md
features/parallel-strategy.md
🤖 Prompt for AI Agents
In docs/source/index.rst around lines 74 to 76, the toctree entry references the
file features/paged-attention-ifb-scheduling.md, but the actual file name is
features/paged-attention-ifb-scheduler.md. Update the toctree entry to use the
correct file name features/paged-attention-ifb-scheduler.md to fix the broken
link and prevent Sphinx build errors.

Comment on lines +174 to +181

```python
from .modeling_mymodel import MyModelForCausalLM

__all__ = [
...,
"MyModelForCausalLM",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

__all__ update is incomplete – makes the import silently fail

If MyModelForCausalLM is added to __all__ but MyModel and MyConfig are not, wildcard imports (from …models import *) will expose an incomplete surface and break user code following the earlier examples in this doc. Add all public symbols of the new model.

🤖 Prompt for AI Agents
In docs/source/models/adding-new-model.md around lines 174 to 181, the __all__
list includes "MyModelForCausalLM" but omits other public symbols like "MyModel"
and "MyConfig". To fix this, update the __all__ list to include all public
symbols related to the new model, ensuring wildcard imports expose the complete
API surface as intended.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (6)
docs/source/features/multi-modality.md (1)

25-28: Replace raw Markdown URL with Sphinx :doc: cross-reference.

[support matrix](/docs/source/reference/support-matrix.md) will break once rendered because Sphinx does not resolve absolute GitHub paths.
Use an internal cross-reference instead:

-Please refer to the latest multimodal [support matrix](/docs/source/reference/support-matrix.md).
+Please refer to the latest multimodal :doc:`support matrix <reference/support-matrix>`.
docs/source/developer-guide/perf-analysis.md (1)

81-88: Escaped newline still swallows the inline comment.

The back-slash before the comment causes “# or trtllm-serve command” to be parsed as part of trtllm-bench and breaks the example.

-  trtllm-bench \ # or trtllm-serve command
+# Replace with `trtllm-serve` if desired
+  trtllm-bench \
docs/source/index.rst (1)

74-76: Broken toctree link – filename mismatch.

Entry points to features/paged-attention-ifb-scheduling.md, but the file added is features/paged-attention-ifb-scheduler.md. This will fail the Sphinx build.

-   features/paged-attention-ifb-scheduling.md
+   features/paged-attention-ifb-scheduler.md
docs/source/features/paged-attention-ifb-scheduler.md (1)

139-140: Dead reference to removed file useful-build-time-flags.md.

The linked file was deleted in the restructure, leading to a 404. Either:

  1. Point to an equivalent section in another doc, or
  2. Remove the hyperlink entirely.
docs/source/features/sampling.md (1)

78-78: Fix typo in guided-decoding parameter list

strucutral_tagstructural_tag.
Without the correction, users copy-pasting the snippet will hit TypeError: got an unexpected keyword argument.

-    * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `strucutral_tag`).
+    * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
docs/source/features/speculative-decoding.md (1)

84-84: Import path still inconsistent (llm_api vs llmapi)

The package exposes tensorrt_llm.llmapi; tensorrt_llm.llm_api does not exist. Snippets will fail at import time.

-from tensorrt_llm.llm_api import NGramDecodingConfig
+from tensorrt_llm.llmapi import NGramDecodingConfig

(Apply the same fix on Lines 105 and 120.)

Also applies to: 105-105, 120-120

🧹 Nitpick comments (9)
docs/source/features/multi-modality.md (1)

35-37: Prefer :doc: for intra-repo example links.

The quick-start hyperlink hard-codes a GitHub path. Relying on absolute paths breaks local builds and future directory moves. Replace with a Sphinx doc role or a relative link to keep the docs portable.
No blocking issue, but worth tightening before the 1.0 cut-off.

docs/source/features/lora.md (1)

157-175: YAML snippet: preserve indentation for correct rendering.

Sphinx‐Markdown sometimes collapses leading spaces in fenced YAML blocks, which can cause copy-paste errors. Add yaml after the opening back-tick fence so the highlighter preserves indentation:

-```yaml
+```yaml

( same change applies to the second YAML block at Lines 198-217 )

docs/source/developer-guide/perf-analysis.md (1)

12-24: List indentation breaks markdownlint rules & Sphinx bullet rendering.

The nested bullets are indented by two spaces instead of zero/four, triggering MD007 and causing mis-rendered bullets in HTML. Adjust to standard 0/2/4 indentation.

docs/source/features/paged-attention-ifb-scheduler.md (1)

145-145: Tone: remove stacked ‘extremely’ intensifiers.

Over-using intensifiers weakens technical prose. Consider replacing “extremely beneficial” / “extremely important” with a single, stronger adjective or drop the adverb.

docs/source/features/disagg-serving.md (2)

41-41: Typo affects searchability (“rationale”)

“rational” (adjective) should be the noun “rationale” in this context.

-...for more details about the rational and design considerations of disaggregated serving.
+...for more details about the rationale and design considerations of disaggregated serving.

151-151: Correct misspelling of “servers”

generation serfversgeneration servers.

-when routing requests to the generation serfvers, the disaggregated server will mark the requests as "generation-only" to skip the context phase.
+when routing requests to the generation servers, the disaggregated server will mark the requests as "generation-only" to skip the context phase.
docs/source/features/speculative-decoding.md (1)

237-237: Minor typo: “drat tokens”

dratdraft.

-`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are drat tokens.
+`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are draft tokens.
docs/source/developer-guide/perf-benchmarking.md (1)

459-459: Typo: “quantizaton” → “quantization”

-If you would like to force the KV cache quantizaton, you can specify the following in the YAML file to force the precision
+If you would like to force the KV cache quantization, you can specify the following in the YAML file to force the precision
docs/source/features/attention.md (1)

65-69: Minor wording – drop “outside of”

Consider the slimmer wording; it appears repeatedly in the doc.

-| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside of the backend. |
+| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside the backend. |
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b99ca2f and 33a6c4e.

⛔ Files ignored due to path filters (2)
  • docs/source/features/media/feat_long_seq_chunked_attention.png is excluded by !**/*.png
  • docs/source/features/media/feat_long_seq_sliding_win_attn.png is excluded by !**/*.png
📒 Files selected for processing (20)
  • docs/source/developer-guide/perf-analysis.md (1 hunks)
  • docs/source/developer-guide/perf-benchmarking.md (1 hunks)
  • docs/source/examples/dynamo_k8s_example.rst (1 hunks)
  • docs/source/features/attention.md (1 hunks)
  • docs/source/features/disagg-serving.md (1 hunks)
  • docs/source/features/feature-combination-matrix.md (1 hunks)
  • docs/source/features/kvcache.md (1 hunks)
  • docs/source/features/long-sequence.md (1 hunks)
  • docs/source/features/lora.md (1 hunks)
  • docs/source/features/multi-modality.md (1 hunks)
  • docs/source/features/overlap-scheduler.md (1 hunks)
  • docs/source/features/paged-attention-ifb-scheduler.md (1 hunks)
  • docs/source/features/parallel-strategy.md (1 hunks)
  • docs/source/features/quantization.md (1 hunks)
  • docs/source/features/sampling.md (1 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
  • docs/source/index.rst (3 hunks)
  • docs/source/installation/index.rst (1 hunks)
  • docs/source/models/adding-new-model.md (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
✅ Files skipped from review due to trivial changes (5)
  • docs/source/features/overlap-scheduler.md
  • docs/source/features/quantization.md
  • docs/source/examples/dynamo_k8s_example.rst
  • docs/source/features/feature-combination-matrix.md
  • docs/source/features/long-sequence.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • docs/source/features/parallel-strategy.md
  • docs/source/models/adding-new-model.md
  • docs/source/features/kvcache.md
  • docs/source/installation/index.rst
  • docs/source/models/supported-models.md
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • docs/source/features/lora.md
  • docs/source/features/disagg-serving.md
  • docs/source/features/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/features/lora.md
  • docs/source/index.rst
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/disagg-serving.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
  • docs/source/features/speculative-decoding.md
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/features/lora.md
  • docs/source/index.rst
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
  • docs/source/features/speculative-decoding.md
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/features/lora.md
  • docs/source/index.rst
  • docs/source/features/attention.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/multi-modality.md
  • docs/source/features/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/features/multi-modality.md
  • docs/source/features/paged-attention-ifb-scheduler.md
  • docs/source/features/sampling.md
  • docs/source/features/speculative-decoding.md
🪛 LanguageTool
docs/source/features/attention.md

[style] ~68-~68: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...osition of each token in each sequence. May be None if positional embedding is appl...

(MISSING_IT_THERE)


[style] ~68-~68: This phrase is redundant. Consider using “outside”.
Context: ...None if positional embedding is applied outside of the backend. | | request_ids | List[int...

(OUTSIDE_OF)


[style] ~131-~131: ‘in proportion to’ might be wordy. Consider a shorter alternative.
Context: ...int is significant (grows quadratically in proportion to the sequence length). The `TrtllmAtten...

(EN_WORDINESS_PREMIUM_IN_PROPORTION_TO)


[style] ~158-~158: As a shorter alternative for ‘able to’, consider using “can”.
Context: ... attention in TensorRT-LLM. That kernel is able to apply pre-processing on the Q, K, and V...

(BE_ABLE_TO)


[style] ~168-~168: To elevate your writing, try using a synonym here.
Context: ...efinition of 'small' in that context is hard to quantify because it depends on the m...

(HARD_TO)


[style] ~254-~254: Consider using a synonym to be more concise.
Context: ...tion of many output tokens, it may take a lot of steps to reach that point). #### Paged...

(A_LOT_OF)


[style] ~312-~312: Consider removing “of” to be more concise
Context: ...implementation of StreamingLLM. It uses all of the tokens for computation and only saves `...

(ALL_OF_THE)

docs/source/developer-guide/perf-analysis.md

[style] ~69-~69: In American English, abbreviations like “etc.” require a period.
Context: ...g, such as GIL, debugging NVTX markers, etc: ```bash #!/bin/bash # Prepare datase...

(ETC_PERIOD)

docs/source/developer-guide/perf-benchmarking.md

[style] ~13-~13: Consider using a more concise synonym.
Context: ...uild tuned engines for benchmarking for a variety of models and platforms. - An entirely Pyt...

(A_VARIETY_OF)


[style] ~147-~147: Consider a more concise word here.
Context: ...plete JSON entry is on every line. ``` In order to prepare a synthetic dataset, you can us...

(IN_ORDER_TO_PREMIUM)


[style] ~325-~325: ‘due to the fact that’ might be wordy. Consider a shorter alternative.
Context: ...re texts and not tokenized ids. This is due to the fact that the prompt and the media (image/video...

(EN_WORDINESS_PREMIUM_DUE_TO_THE_FACT_THAT)


[grammar] ~459-~459: Ensure spelling is correct
Context: ...If you would like to force the KV cache quantizaton, you can specify the following in the Y...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/source/features/disagg-serving.md

[style] ~202-~202: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. * `TRTLLM_KVCACHE_SEND_MAX_CONCURRE...

(MORE_THAN_EXCEEDS)


[style] ~204-~204: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. There are some other useful environ...

(MORE_THAN_EXCEEDS)


[style] ~208-~208: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...

(MAKE_SURE_ENSURE)

docs/source/features/multi-modality.md

[style] ~3-~3: Consider using a more concise synonym.
Context: ... in TensorRT-LLM TensorRT-LLM supports a variety of multimodal models, enabling efficient i...

(A_VARIETY_OF)

docs/source/features/paged-attention-ifb-scheduler.md

[style] ~123-~123: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...x num token budget of 12, the scheduler is able to schedule two additional requests, Reque...

(BE_ABLE_TO)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...support. For long-context cases this is extremely important, because setting extremely large values...

(EN_WEAK_ADJECTIVE)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...is extremely important, because setting extremely large values of max-num tokens takes away fro...

(EN_WEAK_ADJECTIVE)

docs/source/features/speculative-decoding.md

[style] ~11-~11: Consider removing “of” to be more concise
Context: ...lementation. The table below enumerates all of the algorithm/model combinations that are s...

(ALL_OF_THE)


[style] ~94-~94: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s currently only supported by Deepseek. MTP can be tuned with the following configu...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Consider removing “of” to be more concise
Context: ...ts of a speculative decoding algorithm. All of the interfaces are defined in [`_torch/spec...

(ALL_OF_THE)


[style] ~161-~161: Consider removing “of” to be more concise
Context: ... is problematic. Ideally, we would move all of the kernel dispatching logic to a lower lev...

(ALL_OF_THE)


[style] ~187-~187: Consider removing “of” to be more concise
Context: ...et model as a submodule. 7. Drafter: All of the logic required to actually produce draf...

(ALL_OF_THE)

🪛 markdownlint-cli2 (0.17.2)
docs/source/features/attention.md

24-24: Bare URL used

(MD034, no-bare-urls)

docs/source/developer-guide/perf-analysis.md

13-13: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


18-18: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

docs/source/features/sampling.md

78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

docs/source/features/speculative-decoding.md

33-33: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


34-34: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


55-55: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


56-56: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


77-77: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


79-79: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


80-80: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


81-81: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


96-96: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


97-97: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


98-98: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


99-99: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


100-100: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


115-115: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


116-116: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


117-117: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


132-132: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


133-133: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


134-134: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


135-135: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


139-139: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


194-194: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


195-195: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


198-198: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


199-199: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/features/paged-attention-ifb-scheduler.md (1)

76-80: Stale link after docs restructure.

../performance/perf-best-practices.md no longer exists. Update to the new path under developer-guide/ (or drop the link if the target was removed).

Would you confirm the new location of “Performance Best Practices” so I can propose an exact fix?

Comment on lines +113 to +118
| Key | Required | Type | Description |
| :-------------- | :------: | :-----------: | :---------------------------------------------- |
| `task_id` | Y | String | Unique identifier for the request. |
| `prompt` | N* | String | Input text for a generation request. |
| `input_ids` | Y* | List[Integer] | List of logits that make up the request prompt. |
| `output_tokens` | Y | Integer | Number of generated tokens for this request. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Incorrect field description – input_ids are token IDs, not logits

The table claims that input_ids is a “List of logits”. That is misleading and will confuse users preparing datasets.

-| `input_ids`     |    Y*    | List[Integer] | List of logits that make up the request prompt. |
+| `input_ids`     |    Y*    | List[Integer] | Token IDs that make up the request prompt.      |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Key | Required | Type | Description |
| :-------------- | :------: | :-----------: | :---------------------------------------------- |
| `task_id` | Y | String | Unique identifier for the request. |
| `prompt` | N* | String | Input text for a generation request. |
| `input_ids` | Y* | List[Integer] | List of logits that make up the request prompt. |
| `output_tokens` | Y | Integer | Number of generated tokens for this request. |
| Key | Required | Type | Description |
| :-------------- | :------: | :-----------: | :---------------------------------------------- |
| `task_id` | Y | String | Unique identifier for the request. |
| `prompt` | N* | String | Input text for a generation request. |
| `input_ids` | Y* | List[Integer] | Token IDs that make up the request prompt. |
| `output_tokens` | Y | Integer | Number of generated tokens for this request. |
🤖 Prompt for AI Agents
In docs/source/developer-guide/perf-benchmarking.md around lines 113 to 118, the
description for the `input_ids` field incorrectly states it as a "List of
logits." Update this description to correctly state that `input_ids` is a list
of token IDs, not logits, to accurately reflect its purpose and avoid user
confusion.

Comment on lines +262 to +263
blocks when required. See the implementation of
[`KVCacheManager`](source:tensorrt_llm/_torch/pyexecutor/resource_manager.py).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Invalid Sphinx link syntax – use a resolvable reference

The custom source: scheme will not be resolved by Sphinx and breaks the build.

-[`KVCacheManager`](source:tensorrt_llm/_torch/pyexecutor/resource_manager.py).
+See :py:class:`tensorrt_llm._torch.pyexecutor.resource_manager.KVCacheManager` for implementation details.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
blocks when required. See the implementation of
[`KVCacheManager`](source:tensorrt_llm/_torch/pyexecutor/resource_manager.py).
blocks when required. See the implementation of
See :py:class:`tensorrt_llm._torch.pyexecutor.resource_manager.KVCacheManager` for implementation details.
🤖 Prompt for AI Agents
In docs/source/features/attention.md around lines 262 to 263, the link to
KVCacheManager uses an invalid custom source: scheme that Sphinx cannot resolve,
causing build errors. Replace the custom source: link with a proper
Sphinx-compatible reference, such as a relative or absolute path link or a
cross-reference directive, ensuring the link points correctly to the
resource_manager.py file or the KVCacheManager documentation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (12)
docs/source/features/speculative-decoding.md (3)

84-84: Incorrect import path – should use tensorrt_llm.llmapi.
The module is spelled llmapi (no underscore) everywhere else in the code-base. Leaving this as-is will break copy-paste examples.

-from tensorrt_llm.llm_api import NGramDecodingConfig
+from tensorrt_llm.llmapi import NGramDecodingConfig

105-105: Same wrong module name as above.
Fix to maintain consistency and avoid import errors.

-from tensorrt_llm.llm_api import MTPDecodingConfig
+from tensorrt_llm.llmapi import MTPDecodingConfig

120-120: Third occurrence of the wrong import path.
Please correct.

-from tensorrt_llm.llm_api import UserProvidedDecodingConfig
+from tensorrt_llm.llmapi import UserProvidedDecodingConfig
docs/source/features/multi-modality.md (1)

25-28: Broken link to support matrix – use Sphinx :doc:.
Relative URLs are not resolved in the HTML build; this was already raised previously.

-Please refer to the latest multimodal [support matrix](/docs/source/reference/support-matrix.md).
+Please refer to the latest multimodal :doc:`support matrix <reference/support-matrix>`.
docs/source/developer-guide/perf-analysis.md (1)

87-88: Escaped newline still breaks the command.
The back-slash causes “# or trtllm-serve command” to be parsed as part of the executable. Move the comment to its own line or drop the back-slash.

-  trtllm-bench \ # or trtllm-serve command
+# Replace with `trtllm-serve` if desired
+  trtllm-bench \
docs/source/index.rst (1)

69-70: Broken toctree entry – filename mismatch (scheduling vs scheduler).
The file added in this PR is paged-attention-ifb-scheduler.md, but the toctree points to paged-attention-ifb-scheduling.md. Sphinx will fail to build.

-   features/paged-attention-ifb-scheduling.md
+   features/paged-attention-ifb-scheduler.md
docs/source/features/paged-attention-ifb-scheduler.md (2)

76-79: Broken relative link to performance best-practices
The path ../performance/perf-best-practices.md no longer exists after the doc restructure; it will 404.

Update to the new location under developer-guide/ (e.g. ../../developer-guide/perf-best-practices.md) and verify the anchor.


137-141: Stale reference to useful-build-time-flags.md

useful-build-time-flags.md was removed; this hyperlink will break. Replace with a valid target or drop the link.

docs/source/developer-guide/perf-benchmarking.md (2)

17-19: Missing URL for [in-flight batching] link

The link renders as plain text. Add the destination, e.g.:

-`trtllm-bench` executes all benchmarks using [in-flight batching] --
+`trtllm-bench` executes all benchmarks using [in-flight batching](../advanced/gpt-attention.md#in-flight-batching) --

116-118: Table entry incorrectly says “List of logits”

input_ids are token IDs, not logits. Replace description with “Token IDs that make up the request prompt.”

docs/source/features/attention.md (1)

260-264: Duplicate: invalid source: link still breaks the Sphinx build
The custom source: scheme is unresolved by Sphinx; replace with a proper cross-reference (e.g. :py:class:).

docs/source/models/adding-new-model.md (1)

175-181: Expose all public symbols in __all__
Only MyModelForCausalLM is exported, so wildcard imports will omit MyConfig and MyModel, breaking example code that relies on them. Add the missing symbols.

 __all__ = [
     ...,
+    "MyConfig",
+    "MyModel",
     "MyModelForCausalLM",
 ]
🧹 Nitpick comments (10)
docs/source/features/speculative-decoding.md (1)

236-239: Typo: “drat tokens” → “draft tokens”.
Minor copy edit to prevent confusion.

-then `d2` can be accepted. And so on until draft tokens cannot be accepted anymore.
+then `d2` can be accepted. And so on until draft tokens can no longer be accepted.
docs/source/features/multi-modality.md (1)

35-37: Use Sphinx roles for internal example links.
Raw Git-style paths break once docs are moved. Prefer :doc: or :ref: to keep links portable.

-Quickly try out TensorRT-LLM's multimodal support using our `LLM-API` and a ready-to-run [example](/examples/llm-api/quickstart_multimodal.py):
+Quickly try out TensorRT-LLM's multimodal support using our `LLM-API` and the ready-to-run :doc:`example <../../examples/llm-api/quickstart_multimodal.py>`:
docs/source/features/long-sequence.md (2)

14-21: Use a Python-tagged fence instead of Bash for a Python snippet

The snippet is pure Python but is fenced as bash. Sphinx/Markdown renderers won’t enable Python syntax highlighting and copy-paste helpers.

-```bash
+```python

71-71: Replace “beam searching” with the correct term “beam search”

Minor wording fix for consistency with the rest of the docs.

docs/source/features/disagg-serving.md (1)

150-152: Typo: “serfvers” ⇒ “servers”

Line 151:
generation serfversgeneration servers

-when routing requests to the generation serfvers,
+when routing requests to the generation servers,
docs/source/developer-guide/perf-benchmarking.md (1)

313-345: Add language identifiers to fenced code blocks

Several code fences (e.g. dataset-prep and benchmark commands) lack a language hint, triggering MD040 and losing syntax highlighting. Example fix:

-```
+```shell

Apply to all unlabeled fences in this section.

docs/source/features/checkpoint-loading.md (1)

324-327: Typo: “asscoiated” → “associated”.

Small spelling error in the final paragraph; fix to keep the doc professional.

-By setting the model name, the registered mapper will be asscoiated with the specific model.
+By setting the model name, the registered mapper will be associated with the specific model.
docs/source/features/attention.md (1)

246-255: Add language identifier to fenced code block for KV-cache shape

The block beginning with the contiguous KV-cache shape lacks a language tag, triggering markdown-lint (MD040). Specify an identifier (e.g. text) to silence the linter and improve rendering.

-```
+[text]
+```
docs/source/features/kvcache.md (2)

3-3: Replace verbose-phrase “in order to” with “to” for conciseness

Four separate sentences use the phrase “in order to …”. Replacing it with “to …” makes the prose tighter without changing meaning.

- … reuse during generation in order to avoid redundant calculations.
+ … reuse during generation to avoid redundant calculations.

- … heads in order to support variable attention …
+ … heads to support variable attention …

- … window size in order to reduce computations …
+ … window size to reduce computations …

- … attention in order to save memory.
+ … attention to save memory.

Also applies to: 7-7, 15-15, 19-19


11-11: Fix possessive pronoun “its KV state”

it's is a contraction of “it is.” Use the possessive pronoun its here.

- When a block is evicted from primary memory, it's KV state is copied …
+ When a block is evicted from primary memory, its KV state is copied …
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33a6c4e and 1857d97.

⛔ Files ignored due to path filters (2)
  • docs/source/features/media/feat_long_seq_chunked_attention.png is excluded by !**/*.png
  • docs/source/features/media/feat_long_seq_sliding_win_attn.png is excluded by !**/*.png
📒 Files selected for processing (21)
  • docs/source/developer-guide/perf-analysis.md (1 hunks)
  • docs/source/developer-guide/perf-benchmarking.md (1 hunks)
  • docs/source/examples/dynamo_k8s_example.rst (1 hunks)
  • docs/source/features/attention.md (1 hunks)
  • docs/source/features/checkpoint-loading.md (1 hunks)
  • docs/source/features/disagg-serving.md (1 hunks)
  • docs/source/features/feature-combination-matrix.md (1 hunks)
  • docs/source/features/kvcache.md (1 hunks)
  • docs/source/features/long-sequence.md (1 hunks)
  • docs/source/features/lora.md (1 hunks)
  • docs/source/features/multi-modality.md (1 hunks)
  • docs/source/features/overlap-scheduler.md (1 hunks)
  • docs/source/features/paged-attention-ifb-scheduler.md (1 hunks)
  • docs/source/features/parallel-strategy.md (1 hunks)
  • docs/source/features/quantization.md (1 hunks)
  • docs/source/features/sampling.md (1 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
  • docs/source/index.rst (1 hunks)
  • docs/source/installation/index.rst (1 hunks)
  • docs/source/models/adding-new-model.md (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • docs/source/features/sampling.md
  • docs/source/features/feature-combination-matrix.md
🚧 Files skipped from review as they are similar to previous changes (7)
  • docs/source/features/overlap-scheduler.md
  • docs/source/examples/dynamo_k8s_example.rst
  • docs/source/features/quantization.md
  • docs/source/features/lora.md
  • docs/source/installation/index.rst
  • docs/source/models/supported-models.md
  • docs/source/features/parallel-strategy.md
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/features/kvcache.md
  • docs/source/features/speculative-decoding.md
  • docs/source/features/long-sequence.md
  • docs/source/features/paged-attention-ifb-scheduler.md
  • docs/source/features/multi-modality.md
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/index.rst
  • docs/source/features/speculative-decoding.md
  • docs/source/models/adding-new-model.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/multi-modality.md
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/index.rst
  • docs/source/features/speculative-decoding.md
  • docs/source/features/long-sequence.md
  • docs/source/features/attention.md
  • docs/source/models/adding-new-model.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/multi-modality.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/index.rst
  • docs/source/features/speculative-decoding.md
  • docs/source/features/long-sequence.md
  • docs/source/models/adding-new-model.md
  • docs/source/features/checkpoint-loading.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/disagg-serving.md
  • docs/source/features/multi-modality.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • docs/source/features/speculative-decoding.md
  • docs/source/features/disagg-serving.md
🪛 LanguageTool
docs/source/features/kvcache.md

[style] ~3-~3: Consider a more concise word here.
Context: ...value pairs for reuse during generation in order to avoid redundant calculations. The Tenso...

(IN_ORDER_TO_PREMIUM)


[style] ~7-~7: Consider a more concise word here.
Context: ...tention window size and number of heads in order to support variable attention window size ...

(IN_ORDER_TO_PREMIUM)


[style] ~15-~15: Consider a more concise word here.
Context: ...yers with limited attention window size in order to reduce computations and memory usage. B...

(IN_ORDER_TO_PREMIUM)


[style] ~19-~19: Consider a more concise word here.
Context: ...es advantage of grouped query attention in order to save memory. KV cache will create block...

(IN_ORDER_TO_PREMIUM)

docs/source/features/speculative-decoding.md

[style] ~11-~11: Consider removing “of” to be more concise
Context: ...lementation. The table below enumerates all of the algorithm/model combinations that are s...

(ALL_OF_THE)


[style] ~94-~94: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s currently only supported by Deepseek. MTP can be tuned with the following configu...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Consider removing “of” to be more concise
Context: ...ts of a speculative decoding algorithm. All of the interfaces are defined in [`_torch/spec...

(ALL_OF_THE)


[style] ~161-~161: Consider removing “of” to be more concise
Context: ... is problematic. Ideally, we would move all of the kernel dispatching logic to a lower lev...

(ALL_OF_THE)


[style] ~187-~187: Consider removing “of” to be more concise
Context: ...et model as a submodule. 7. Drafter: All of the logic required to actually produce draf...

(ALL_OF_THE)

docs/source/features/paged-attention-ifb-scheduler.md

[style] ~123-~123: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...x num token budget of 12, the scheduler is able to schedule two additional requests, Reque...

(BE_ABLE_TO)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...support. For long-context cases this is extremely important, because setting extremely large values...

(EN_WEAK_ADJECTIVE)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...is extremely important, because setting extremely large values of max-num tokens takes away fro...

(EN_WEAK_ADJECTIVE)

docs/source/features/attention.md

[style] ~68-~68: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...osition of each token in each sequence. May be None if positional embedding is appl...

(MISSING_IT_THERE)


[style] ~68-~68: This phrase is redundant. Consider using “outside”.
Context: ...None if positional embedding is applied outside of the backend. | | request_ids | List[int...

(OUTSIDE_OF)


[style] ~131-~131: ‘in proportion to’ might be wordy. Consider a shorter alternative.
Context: ...int is significant (grows quadratically in proportion to the sequence length). The `TrtllmAtten...

(EN_WORDINESS_PREMIUM_IN_PROPORTION_TO)


[style] ~158-~158: As a shorter alternative for ‘able to’, consider using “can”.
Context: ... attention in TensorRT-LLM. That kernel is able to apply pre-processing on the Q, K, and V...

(BE_ABLE_TO)


[style] ~168-~168: To elevate your writing, try using a synonym here.
Context: ...efinition of 'small' in that context is hard to quantify because it depends on the m...

(HARD_TO)


[style] ~254-~254: Consider using a synonym to be more concise.
Context: ...tion of many output tokens, it may take a lot of steps to reach that point). #### Paged...

(A_LOT_OF)


[style] ~312-~312: Consider removing “of” to be more concise
Context: ...implementation of StreamingLLM. It uses all of the tokens for computation and only saves `...

(ALL_OF_THE)

docs/source/features/checkpoint-loading.md

[style] ~146-~146: Consider using a different adverb to strengthen your wording.
Context: ...ent all four components when supporting a completely new checkpoint format - **Custom Weight...

(COMPLETELY_ENTIRELY)


[grammar] ~326-~326: Ensure spelling is correct
Context: ...del name, the registered mapper will be asscoiated with the specific model.

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/source/developer-guide/perf-analysis.md

[style] ~69-~69: In American English, abbreviations like “etc.” require a period.
Context: ...g, such as GIL, debugging NVTX markers, etc: ```bash #!/bin/bash # Prepare datase...

(ETC_PERIOD)

docs/source/developer-guide/perf-benchmarking.md

[style] ~13-~13: Consider using a more concise synonym.
Context: ...uild tuned engines for benchmarking for a variety of models and platforms. - An entirely Pyt...

(A_VARIETY_OF)


[style] ~147-~147: Consider a more concise word here.
Context: ...plete JSON entry is on every line. ``` In order to prepare a synthetic dataset, you can us...

(IN_ORDER_TO_PREMIUM)


[style] ~325-~325: ‘due to the fact that’ might be wordy. Consider a shorter alternative.
Context: ...re texts and not tokenized ids. This is due to the fact that the prompt and the media (image/video...

(EN_WORDINESS_PREMIUM_DUE_TO_THE_FACT_THAT)

docs/source/features/disagg-serving.md

[grammar] ~151-~151: Ensure spelling is correct
Context: ...when routing requests to the generation serfvers, the disaggregated server will mark the...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~202-~202: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. * `TRTLLM_KVCACHE_SEND_MAX_CONCURRE...

(MORE_THAN_EXCEEDS)


[style] ~204-~204: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. There are some other useful environ...

(MORE_THAN_EXCEEDS)


[style] ~208-~208: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...

(MAKE_SURE_ENSURE)

docs/source/features/multi-modality.md

[style] ~3-~3: Consider using a more concise synonym.
Context: ... in TensorRT-LLM TensorRT-LLM supports a variety of multimodal models, enabling efficient i...

(A_VARIETY_OF)

🪛 markdownlint-cli2 (0.17.2)
docs/source/features/speculative-decoding.md

33-33: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


34-34: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


55-55: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


56-56: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


77-77: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


79-79: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


80-80: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


81-81: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


96-96: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


97-97: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


98-98: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


99-99: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


100-100: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


115-115: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


116-116: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


117-117: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


132-132: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


133-133: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


134-134: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


135-135: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


139-139: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


194-194: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


195-195: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


198-198: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


199-199: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

docs/source/features/attention.md

159-159: Link fragments should be valid

(MD051, link-fragments)


234-234: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


273-273: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


293-293: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


313-313: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


329-329: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


349-349: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/source/developer-guide/perf-analysis.md

13-13: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


18-18: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

docs/source/developer-guide/perf-benchmarking.md

159-159: Link fragments should be valid

(MD051, link-fragments)


234-234: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


273-273: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


293-293: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


313-313: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


329-329: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


349-349: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/features/long-sequence.md (1)

38-40: “llama4 model” is misleading / unversioned

llama4 is not an official HF identifier and might confuse users. Please specify the exact model name (e.g. meta-llama/Llama-4-7b-hf) or qualify that support is limited to “Llama 4 family models”.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/source/index.rst (1)

61-70: Broken toctree entry – filename mismatch remains unresolved

The toctree still points to features/paged-attention-ifb-scheduling.md, but the actual file added in this PR is features/paged-attention-ifb-scheduler.md.
Sphinx will fail to build the docs with a “file not found” error.

-   features/paged-attention-ifb-scheduling.md
+   features/paged-attention-ifb-scheduler.md
🧹 Nitpick comments (2)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (2)

246-252: Add a language identifier to fenced code block

Markdown-lint (MD040) flags this block; the opening fence lacks a language hint:

-```
+```text

Using text (or none) keeps the table un-highlighted while silencing the linter.


324-333: Second unlabeled code fence – specify language

Same MD040 issue as above. Prefix the fence with text (or none) to avoid warnings and keep CI green.

-```
+```text
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1857d97 and 0930b65.

📒 Files selected for processing (6)
  • docs/source/commands/trtllm-eval.rst (1 hunks)
  • docs/source/deployment-guide/index.rst (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (3 hunks)
  • docs/source/index.rst (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
  • docs/source/deployment-guide/index.rst
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
  • docs/source/commands/trtllm-eval.rst
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/index.rst
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/index.rst
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/index.rst
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

[style] ~115-~115: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~127-~127: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~142-~142: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~146-~146: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~156-~156: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~160-~160: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~172-~172: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~178-~178: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~303-~303: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

260-260: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


346-346: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingz nv-guomingz force-pushed the user/guomingz/1.0_doc branch from 0930b65 to c5ef99e Compare August 8, 2025 06:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)

324-350: Same fenced-block language issue as above for the sample benchmark output.

Use text (or none) so automated docs checks pass and syntax highlighting is clearer.

🧹 Nitpick comments (5)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (5)

15-28: Prerequisite / GPU support wording is contradictory.

Lines 15-19 state Hopper or Blackwell GPUs are supported, yet line 27 says NVFP4 is Blackwell-only. Consider splitting the prerequisites by quantization mode or explicitly noting “FP8: Hopper/Blackwell, NVFP4: Blackwell only” to avoid confusion.


33-44: Add --pull=always (or similar) to ensure users get the intended image.

Including an explicit pull policy in the docker command guards against stale local images and avoids subtle mismatches with the documented features.


67-69: Call out dtype linkage to model precision.

The sample YAML hard-codes dtype: fp8; remind readers that this must match the loaded checkpoint (e.g., change to fp4 for NVFP4) to prevent silent fallback to higher precision.


238-268: Specify a language for fenced code blocks to satisfy markdownlint (MD040).

Add shell to the opening back-ticks for command snippets and text or none for console output blocks:

-```
+```shell

255-255: Minor typo – “LLama” → “Llama”.

Correct casing for consistency with the rest of the document.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0930b65 and c5ef99e.

📒 Files selected for processing (11)
  • docs/source/commands/trtllm-eval.rst (1 hunks)
  • docs/source/conf.py (1 hunks)
  • docs/source/deployment-guide/index.rst (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (3 hunks)
  • docs/source/index.rst (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
  • docs/source/overview.md (2 hunks)
  • docs/source/quick-start-guide.md (3 hunks)
  • docs/source/reference/support-matrix.md (0 hunks)
💤 Files with no reviewable changes (1)
  • docs/source/reference/support-matrix.md
✅ Files skipped from review due to trivial changes (5)
  • docs/source/overview.md
  • docs/source/deployment-guide/index.rst
  • docs/source/quick-start-guide.md
  • docs/source/commands/trtllm-eval.rst
  • docs/source/conf.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
  • docs/source/models/supported-models.md
  • docs/source/index.rst
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

[style] ~115-~115: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~127-~127: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~142-~142: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~146-~146: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~156-~156: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~160-~160: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~172-~172: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~178-~178: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~303-~303: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

238-238: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


246-246: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


324-324: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)

1-1: Verify model naming (“Llama 3 70B” vs “Llama3.3 70B”).

The official Meta release is branded “Llama 3 70B”. Please confirm whether “3.3” is intentional or a typo and update the file name, title, and model strings accordingly.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/1.0_doc branch 2 times, most recently from f0e5049 to 801e81e Compare August 8, 2025 07:37
@nv-guomingz nv-guomingz requested a review from hlu1 August 8, 2025 07:38
| DeepseekV3ForCausalLM | Yes | Yes | Yes | Yes | Yes [^1] | Yes | No | No | Yes | Yes | Yes [^2] | N/A | Yes | Yes |
| Qwen3MoeForCausalLM | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | N/A | Yes | Yes |
| Llama4ForConditionalGeneration | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Untested | N/A | Yes | Yes |
| GPT-OSS | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Untested | N/A | Yes | Yes |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hlu1 @Tracin please help confirm the gpt-oss feature matrix.
cc @laikhtewari @juney-nvidia

@nv-guomingz nv-guomingz force-pushed the user/guomingz/1.0_doc branch from 801e81e to 1ff7743 Compare August 8, 2025 08:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (11)
docs/source/features/multi-modality.md (1)

25-27: Relative URL will break in Sphinx – use :doc: cross-reference

docs/source/developer-guide/perf-analysis.md (1)

81-88: Escaped back-slash makes the example command invalid

trtllm-bench \ # or trtllm-serve command concatenates the comment into the command. Put the comment on a separate line or drop the back-slash:

# Replace with `trtllm-serve` if desired
trtllm-bench \
docs/source/index.rst (2)

69-69: Incorrect filename – Sphinx will fail

Change features/paged-attention-ifb-scheduling.mdfeatures/paged-attention-ifb-scheduler.md.


153-158: External URLs cannot live inside a toctree

Move the releases/GitHub links outside the .. toctree:: block and render them as plain reST links or a bullet list; otherwise the build errors.

docs/source/features/paged-attention-ifb-scheduler.md (1)

76-79: Fix broken cross-link to performance best practices (path moved).

The relative link points to a removed directory. Update it to the new developer-guide location.

-Refer to
-[the performance best practices](../performance/perf-best-practices.md#chunked-context) for usage.
+Refer to
+[the performance best practices](../developer-guide/perf-analysis.md#chunked-context) for usage.
docs/source/features/sampling.md (1)

78-78: Typo breaks guided-decoding API usage: strucutral_tagstructural_tag.

This will cause “unexpected keyword argument” errors when copied.

-    * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `strucutral_tag`).
+    * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
docs/source/developer-guide/perf-benchmarking.md (2)

17-19: Add the missing link target for “[in-flight batching]”.

The bracketed text lacks a destination. Link to the canonical section in the new features doc.

-`trtllm-bench` executes all benchmarks using [in-flight batching] -- for more information see
-the [in-flight batching section](../advanced/gpt-attention.md#in-flight-batching) that describes the concept
+`trtllm-bench` executes all benchmarks using [in-flight batching](../features/paged-attention-ifb-scheduler.md#in-flight-batching) -- for more information see
+the [in-flight batching section](../features/paged-attention-ifb-scheduler.md#in-flight-batching) that describes the concept
 in further detail.

113-119: Correct field description: input_ids are token IDs, not logits.

This misleads dataset authors and downstream users.

-| `input_ids`     |    Y*    | List[Integer] | List of logits that make up the request prompt. |
+| `input_ids`     |    Y*    | List[Integer] | Token IDs that make up the request prompt.      |
docs/source/models/adding-new-model.md (1)

175-182: Expose all public symbols in all (incomplete export list).

Wildcard imports will miss MyModel and MyConfig. Include all public symbols for the new model to keep examples working.

 from .modeling_mymodel import MyModelForCausalLM
+from .modeling_mymodel import MyModel
+from .configuration_mymodel import MyConfig

 __all__ = [
     ...,
-    "MyModelForCausalLM",
+    "MyModelForCausalLM",
+    "MyModel",
+    "MyConfig",
 ]
docs/source/features/speculative-decoding.md (1)

84-84: Fix import path: use tensorrt_llm.llmapi (not tensorrt_llm.llm_api).

The package exposes llmapi; llm_api will fail at import.

-from tensorrt_llm.llm_api import NGramDecodingConfig
+from tensorrt_llm.llmapi import NGramDecodingConfig
-from tensorrt_llm.llm_api import MTPDecodingConfig
+from tensorrt_llm.llmapi import MTPDecodingConfig
-from tensorrt_llm.llm_api import UserProvidedDecodingConfig
+from tensorrt_llm.llmapi import UserProvidedDecodingConfig

Also applies to: 105-105, 120-120

docs/source/features/attention.md (1)

262-264: Fix invalid Sphinx link (custom scheme breaks build).

Use a resolvable cross-reference.

-See the implementation of
-[`KVCacheManager`](source:tensorrt_llm/_torch/pyexecutor/resource_manager.py).
+See :py:class:`tensorrt_llm._torch.pyexecutor.resource_manager.KVCacheManager` for implementation details.
🧹 Nitpick comments (33)
docs/source/features/overlap-scheduler.md (1)

23-24: Convert bare URL to explicit Markdown link

The plain URL trips MD034. Use [SG-Lang blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#zero-overhead-batch-scheduler) to satisfy markdown-lint and improve readability.

docs/source/features/kvcache.md (1)

3-20: Consider tightening phrasing

The repeated “in order to” makes sentences wordy (four occurrences). Re-phrase (“to reduce”, “to support”, …) for concision.

docs/source/developer-guide/perf-analysis.md (1)

12-24: Fix list indentation to satisfy MD007

Remove the leading two spaces before each bullet so they align with column 0.

docs/source/features/paged-attention-ifb-scheduler.md (5)

20-22: Close the italicized footnote.

The opening underscore for the footnote isn’t closed, which can break formatting downstream.

-_(1) Padding sequences in the generation phase that contain a single token to
-the length of the maximum input sequence is inefficient use of resources.
+_(1) Padding sequences in the generation phase that contain a single token to
+the length of the maximum input sequence is inefficient use of resources._

49-49: Typo: “re-buliding” → “rebuilding”.

-You can tune it using the runtime `max_num_tokens` without re-buliding the engine.
+You can tune it using the runtime `max_num_tokens` without rebuilding the engine.

10-11: Consistency: use “in-flight” consistently.

Standardize “inflight” → “in-flight” for consistency with the rest of the doc.

-For efficiency reasons (1), the support for inflight batching ***requires the
+For efficiency reasons (1), the support for in-flight batching ***requires the

93-95: Add language to fenced code block (markdownlint MD040).

-```
+```text
 [max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].
-```
+```

76-78: Terminology: KV cache capitalization.

Use “KV cache” consistently (not “kv-cache”).

-... the FMHA paged kv-cache also
+... the FMHA paged KV cache also
docs/source/features/sampling.md (3)

9-9: Use absolute links for code references to avoid breakage in the built docs site.

Relative repo paths with line anchors work on GitHub but typically break in the published Sphinx site. Prefer absolute GitHub URLs (optionally without line numbers to reduce staleness).

Example fix:

-2. Pass a [`SamplingParams`](../../../../tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function
+2. Pass a [`SamplingParams`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/sampling_params.py) object with the desired options to the `generate()` function

Would you like me to sweep and update all similar links in this file to absolute GitHub URLs?

Also applies to: 77-77, 105-105


128-129: Ensure logits processor example works with TRT-LLM sampler.

If logits processors are consumed by the TRT-LLM sampler path, align with the earlier guidance by enabling it on LLM.

-llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8')
+llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8', enable_trtllm_sampler=True)

If logits processors are supported without the TRT sampler, ignore this and consider adding a note to clarify when enable_trtllm_sampler is required.


78-78: Normalize list marker to satisfy markdownlint (MD004).

-    * Note: Depending on the type of format, ...
+    - Note: Depending on the type of format, ...
docs/source/features/disagg-serving.md (6)

25-25: Add alt text to images (markdownlint MD045).

Provide concise alt text for accessibility.

-  <img src=".../tech_blog5_Picture1.png" width="640" height="auto">
+  <img src=".../tech_blog5_Picture1.png" alt="Aggregated LLM serving timeline" width="640" height="auto">

Apply similarly to Figures 2–7.

Also applies to: 34-34, 51-51, 62-62, 73-73, 88-88, 175-175


110-125: Add language identifiers to code fences (markdownlint MD040).

-```
+```bash
 # Generate context_extra-llm-api-config.yml
 # Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
 echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n  backend: UCX\n  max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
...
 CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --backend pytorch --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
-```
+```

130-132: Add language to code fence (markdownlint MD040).

-```
+```bash
 trtllm-serve disaggregated -c disagg_config.yaml
-```
+```

135-148: YAML fence for config snippet (markdownlint MD040).

-```
+```yaml
 hostname: localhost
 port: 8000
 backend: pytorch
 context_servers:
   num_instances: 2
   urls:
       - "localhost:8001"
       - "localhost:8002"
 generation_servers:
   num_instances: 1
   urls:
       - "localhost:8003"
-```
+```

151-152: Typo: “serfvers” → “servers”.

-when routing requests to the generation serfvers, the disaggregated server will mark the requests as "generation-only" ...
+when routing requests to the generation servers, the disaggregated server will mark the requests as "generation-only" ...

104-107: Consistency: “KV cache” capitalization.

Standardize “kvCache” → “KV cache”.

-`backend` specifies the communication backend for transferring the kvCache, ...
-`max_tokens_in_buffer` defines the buffer size for kvCache transfers, ...
+`backend` specifies the communication backend for transferring the KV cache, ...
+`max_tokens_in_buffer` defines the buffer size for KV cache transfers, ...
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (3)

27-27: Clarify hardware limitation phrasing.

Slightly improve clarity.

-Note that NVFP4 is only supported on NVIDIA Blackwell
+Note: NVFP4 is only supported on NVIDIA Blackwell.

53-54: Grammar polish.

-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to ...
+To use the latest main branch, build TensorRT-LLM from source by following:
+...

255-255: Typo: “LLama” → “Llama”.

-* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set add\_special\_tokens=False to avoid it.
+* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. Set add\_special\_tokens=False to avoid it.
docs/source/developer-guide/perf-benchmarking.md (2)

313-324: Add language identifiers to fenced blocks (markdownlint MD040).

-```
+```bash
 python ./benchmarks/cpp/prepare_dataset.py \
   --tokenizer Qwen/Qwen2-VL-2B-Instruct \
   --stdout \
   dataset \
   --dataset-name lmms-lab/MMMU \
   --dataset-split test \
   --dataset-image-key image \
   --dataset-prompt-key question \
   --num-requests 10 \
   --output-len-dist 128,5 > mm_data.jsonl
-```
+```

459-459: Typo: “quantizaton” → “quantization”.

-If you would like to force the KV cache quantizaton, you can specify the following in the YAML file to force the precision
+If you would like to force the KV cache quantization, you can specify the following in the YAML file to force the precision
docs/source/models/adding-new-model.md (1)

27-41: Unify brand/style: “Hugging Face” (avoid “Huggingface/HuggingFace’s”).

Consistency improves professionalism and searchability.

Examples:

  • Replace “HuggingFace's transformers” with “Hugging Face Transformers”.
  • Replace “Huggingface's LLaMA” with “Hugging Face’s Llama”.

Also applies to: 134-146

docs/source/features/speculative-decoding.md (5)

33-35: Fix markdownlint MD004: use dash “-” for unordered lists.

Standardize list markers across the doc.

-* KV cache reuse must be disabled (this occurs implicitly).
-* Overlap scheduling must be disabled.
+- KV cache reuse must be disabled (this occurs implicitly).
+- Overlap scheduling must be disabled.
-* Llama 3 variants: [use the checkpoints from the authors of the original EAGLE 3 paper](https://huggingface.co/yuhuili).
-* Llama 4 Maverick: [use the checkpoint from the NVIDIA HuggingFace repository](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3).
+- Llama 3 variants: [use the checkpoints from the authors of the original EAGLE 3 paper](https://huggingface.co/yuhuili).
+- Llama 4 Maverick: [use the checkpoint from the NVIDIA HuggingFace repository](https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3).
-* `max_draft_len`: Maximum draft candidate length.
-* `max_matching_ngram_size`: Maximum prompt suffix length to match with keys in the pool.
-* `is_public_pool`: If true, a single ngram pool is shared for all requests. Otherwise, each request has its own ngram pool.
-* `is_keep_all`: If true, draft candidates will be retained in the pool forever. Otherwise, only the largest draft candidate is retained.
-* `is_use_oldest`: If true, the oldest draft candidate is always proposed for a given match. Otherwise, the newest draft candidate is used. Only applicable if `is_keep_all == True` because `is_keep_all == False` means we'll only ever have a single value for each key.
+- `max_draft_len`: Maximum draft candidate length.
+- `max_matching_ngram_size`: Maximum prompt suffix length to match with keys in the pool.
+- `is_public_pool`: If true, a single ngram pool is shared for all requests. Otherwise, each request has its own ngram pool.
+- `is_keep_all`: If true, draft candidates will be retained in the pool forever. Otherwise, only the largest draft candidate is retained.
+- `is_use_oldest`: If true, the oldest draft candidate is always proposed for a given match. Otherwise, the newest draft candidate is used. Only applicable if `is_keep_all == True` because `is_keep_all == False` means we'll only ever have a single value for each key.
-* `max_draft_len`: Maximum draft candidate length.
-* `num_nextn_predict_layers`: Number of MTP modules to use. Currently must match `max_draft_len`.
-* `use_relaxed_acceptance_for_thinking`: If true, use relaxed decoding for reasoning models in the thinking phase. In this mode, speculation requirements are relaxed for the thinking phase - a draft token may be accepted if it appears in a candidate set constructed with `relaxed_topk` and `relaxed_delta`.
-* `relaxed_topk`: The top K tokens are sampled from the target model's logits to create the initial candidate set for relaxed decoding.
-* `relaxed_delta`: Used to further filter the top K candidate set for relaxed decoding. We remove tokens `t` for which `log(P(top 1 token)) - log(P(t)) > relaxed_delta`.
+- `max_draft_len`: Maximum draft candidate length.
+- `num_nextn_predict_layers`: Number of MTP modules to use. Currently must match `max_draft_len`.
+- `use_relaxed_acceptance_for_thinking`: If true, use relaxed decoding for reasoning models in the thinking phase. In this mode, speculation requirements are relaxed for the thinking phase - a draft token may be accepted if it appears in a candidate set constructed with `relaxed_topk` and `relaxed_delta`.
+- `relaxed_topk`: The top K tokens are sampled from the target model's logits to create the initial candidate set for relaxed decoding.
+- `relaxed_delta`: Used to further filter the top K candidate set for relaxed decoding. We remove tokens `t` for which `log(P(top 1 token)) - log(P(t)) > relaxed_delta`.
-* `max_draft_len`: Maximum draft candidate length.
-* `drafter`: An object of type `Drafter` that implements the `prepare_draft_tokens` method (see [Developer Guide](speculative-decoding.md#developer-guide) 7.)
-* `resource_manager`: An optional `ResourceManager` object (see [Developer Guide](speculative-decoding.md#developer-guide) 4.)
+- `max_draft_len`: Maximum draft candidate length.
+- `drafter`: An object of type `Drafter` that implements the `prepare_draft_tokens` method (see [Developer Guide](speculative-decoding.md#developer-guide) 7.)
+- `resource_manager`: An optional `ResourceManager` object (see [Developer Guide](speculative-decoding.md#developer-guide) 4.)
-* `MTP`
-* `Eagle` (for EAGLE 3)
-* `NGram`
-* `DraftTarget`
+- `MTP`
+- `Eagle` (for EAGLE 3)
+- `NGram`
+- `DraftTarget`
-* KV cache reuse must be disabled.
-* Overlap scheduling must be disabled.
+- KV cache reuse must be disabled.
+- Overlap scheduling must be disabled.
-* `_prepare_draft_requests`
-* `_prepare_draft_tokens`
+- `_prepare_draft_requests`
+- `_prepare_draft_tokens`

Also applies to: 55-56, 77-81, 96-101, 115-118, 132-136, 194-199


139-145: Fix markdownlint MD040: specify language for fenced code block.

Set YAML language for the example.

-```
+```yaml
 disable_overlap_scheduler: true
 speculative_config:
   decoding_type: Eagle
   max_draft_len: 4
   speculative_model: /path/to/draft/model

237-238: Typo in example (“drat” → “draft”).

-`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are drat tokens. Suppose the token after `t` is `d1`
+`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are draft tokens. Suppose the token after `t` is `d1`

61-69: Clarify overlap scheduler guidance for EAGLE 3.

Text says disabling overlap scheduler is only needed when eagle3_one_model is False, but the snippet always passes disable_overlap_scheduler=True. Make the example consistent.

-# Only need to disable overlap scheduler if eagle3_one_model is False.
-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+# Disable overlap scheduler only for the two-model variant.
+llm = LLM(
+    "/path/to/target_model",
+    speculative_config=speculative_config,
+    disable_overlap_scheduler=(not eagle3_one_model),
+)

41-47: Optional: add a one-line note before first snippet listing common imports.

Reduces duplication and keeps snippets concise.

Also applies to: 84-90, 105-111, 120-126

docs/source/features/attention.md (3)

68-69: Grammar: “outside the backend” (remove “of”).

-| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside of the backend. |
+| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside the backend. |

131-133: Concise phrasing for complexity note.

-is the slowest method and the memory footprint is significant (grows quadratically in proportion to the sequence length).
+is the slowest method and the memory footprint is significant (grows quadratically with sequence length).

158-161: Concise phrasing: “can apply” instead of “is able to apply”.

-That kernel is able to apply
+That kernel can apply
docs/source/features/checkpoint-loading.md (2)

326-326: Typo: “asscoiated” → “associated”.

-By setting the model name, the registered mapper will be asscoiated with the specific model.
+By setting the model name, the registered mapper will be associated with the specific model.

92-94: Style: capitalize brand/API (“LLM API”).

-The first approach, through llm-api, as shown in the following example:
+The first approach, through the LLM API, as shown in the following example:
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0e5049 and 1ff7743.

⛔ Files ignored due to path filters (2)
  • docs/source/features/media/feat_long_seq_chunked_attention.png is excluded by !**/*.png
  • docs/source/features/media/feat_long_seq_sliding_win_attn.png is excluded by !**/*.png
📒 Files selected for processing (30)
  • docs/source/commands/trtllm-eval.rst (1 hunks)
  • docs/source/conf.py (1 hunks)
  • docs/source/deployment-guide/index.rst (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (3 hunks)
  • docs/source/developer-guide/perf-analysis.md (1 hunks)
  • docs/source/developer-guide/perf-benchmarking.md (1 hunks)
  • docs/source/examples/dynamo_k8s_example.rst (1 hunks)
  • docs/source/features/attention.md (1 hunks)
  • docs/source/features/checkpoint-loading.md (1 hunks)
  • docs/source/features/disagg-serving.md (1 hunks)
  • docs/source/features/feature-combination-matrix.md (1 hunks)
  • docs/source/features/kvcache.md (1 hunks)
  • docs/source/features/long-sequence.md (1 hunks)
  • docs/source/features/lora.md (1 hunks)
  • docs/source/features/multi-modality.md (1 hunks)
  • docs/source/features/overlap-scheduler.md (1 hunks)
  • docs/source/features/paged-attention-ifb-scheduler.md (1 hunks)
  • docs/source/features/parallel-strategy.md (1 hunks)
  • docs/source/features/quantization.md (1 hunks)
  • docs/source/features/sampling.md (1 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
  • docs/source/index.rst (1 hunks)
  • docs/source/installation/index.rst (1 hunks)
  • docs/source/models/adding-new-model.md (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
  • docs/source/overview.md (2 hunks)
  • docs/source/quick-start-guide.md (3 hunks)
  • docs/source/reference/support-matrix.md (1 hunks)
✅ Files skipped from review due to trivial changes (6)
  • docs/source/deployment-guide/index.rst
  • docs/source/features/feature-combination-matrix.md
  • docs/source/features/quantization.md
  • docs/source/examples/dynamo_k8s_example.rst
  • docs/source/features/lora.md
  • docs/source/commands/trtllm-eval.rst
🚧 Files skipped from review as they are similar to previous changes (10)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
  • docs/source/conf.py
  • docs/source/installation/index.rst
  • docs/source/overview.md
  • docs/source/reference/support-matrix.md
  • docs/source/models/supported-models.md
  • docs/source/quick-start-guide.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
  • docs/source/features/long-sequence.md
  • docs/source/features/parallel-strategy.md
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/features/paged-attention-ifb-scheduler.md
  • docs/source/features/kvcache.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
  • docs/source/features/speculative-decoding.md
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • docs/source/features/paged-attention-ifb-scheduler.md
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/index.rst
  • docs/source/models/adding-new-model.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
  • docs/source/features/speculative-decoding.md
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/index.rst
  • docs/source/models/adding-new-model.md
  • docs/source/features/attention.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/multi-modality.md
  • docs/source/features/speculative-decoding.md
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/index.rst
  • docs/source/models/adding-new-model.md
  • docs/source/features/checkpoint-loading.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/disagg-serving.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
  • docs/source/features/speculative-decoding.md
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • docs/source/features/disagg-serving.md
  • docs/source/features/speculative-decoding.md
🪛 markdownlint-cli2 (0.17.2)
docs/source/features/paged-attention-ifb-scheduler.md

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


141-141: Link fragments should be valid

(MD051, link-fragments)

docs/source/developer-guide/perf-analysis.md

13-13: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


18-18: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

docs/source/features/disagg-serving.md

3-3: Link fragments should be valid
Expected: #motivation; Actual: #Motivation

(MD051, link-fragments)


4-4: Link fragments should be valid
Expected: #kv-cache-exchange; Actual: #KV-Cache-Exchange

(MD051, link-fragments)


5-5: Link fragments should be valid
Expected: #multi-backend-support; Actual: #Multi-backend-Support

(MD051, link-fragments)


6-6: Link fragments should be valid
Expected: #overlap-optimization; Actual: #Overlap-Optimization

(MD051, link-fragments)


7-7: Link fragments should be valid
Expected: #cache-layout-transformation; Actual: #Cache-Layout-Transformation

(MD051, link-fragments)


8-8: Link fragments should be valid
Expected: #usage; Actual: #Usage

(MD051, link-fragments)


10-10: Link fragments should be valid
Expected: #dynamo; Actual: #Dynamo

(MD051, link-fragments)


11-11: Link fragments should be valid
Expected: #environment-variables; Actual: #Environment-Variables

(MD051, link-fragments)


12-12: Link fragments should be valid
Expected: #troubleshooting-and-faq; Actual: #Troubleshooting-and-FAQ

(MD051, link-fragments)


20-20: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


21-21: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


25-25: Images should have alternate text (alt text)

(MD045, no-alt-text)


34-34: Images should have alternate text (alt text)

(MD045, no-alt-text)


51-51: Images should have alternate text (alt text)

(MD045, no-alt-text)


62-62: Images should have alternate text (alt text)

(MD045, no-alt-text)


73-73: Images should have alternate text (alt text)

(MD045, no-alt-text)


78-78: Link fragments should be valid
Expected: #environment-variables; Actual: #Environment-Variables

(MD051, link-fragments)


88-88: Images should have alternate text (alt text)

(MD045, no-alt-text)


110-110: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


130-130: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


135-135: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


175-175: Images should have alternate text (alt text)

(MD045, no-alt-text)


190-190: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


192-192: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


194-194: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


196-196: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


198-198: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


200-200: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


202-202: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


204-204: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


208-208: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


256-256: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


258-258: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


260-260: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

docs/source/features/overlap-scheduler.md

24-24: Bare URL used

(MD034, no-bare-urls)

docs/source/features/sampling.md

78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

docs/source/features/speculative-decoding.md

33-33: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


34-34: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


55-55: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


56-56: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


77-77: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


79-79: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


80-80: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


81-81: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


96-96: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


97-97: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


98-98: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


99-99: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


100-100: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


115-115: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


116-116: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


117-117: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


132-132: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


133-133: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


134-134: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


135-135: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


139-139: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


194-194: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


195-195: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


198-198: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


199-199: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

🪛 LanguageTool
docs/source/features/paged-attention-ifb-scheduler.md

[style] ~123-~123: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...x num token budget of 12, the scheduler is able to schedule two additional requests, Reque...

(BE_ABLE_TO)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...support. For long-context cases this is extremely important, because setting extremely large values...

(EN_WEAK_ADJECTIVE)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...is extremely important, because setting extremely large values of max-num tokens takes away fro...

(EN_WEAK_ADJECTIVE)

docs/source/features/checkpoint-loading.md

[style] ~146-~146: Consider using a different adverb to strengthen your wording.
Context: ...ent all four components when supporting a completely new checkpoint format - **Custom Weight...

(COMPLETELY_ENTIRELY)

docs/source/features/attention.md

[style] ~68-~68: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...osition of each token in each sequence. May be None if positional embedding is appl...

(MISSING_IT_THERE)


[style] ~68-~68: This phrase is redundant. Consider using “outside”.
Context: ...None if positional embedding is applied outside of the backend. | | request_ids | List[int...

(OUTSIDE_OF)


[style] ~131-~131: ‘in proportion to’ might be wordy. Consider a shorter alternative.
Context: ...int is significant (grows quadratically in proportion to the sequence length). The `TrtllmAtten...

(EN_WORDINESS_PREMIUM_IN_PROPORTION_TO)


[style] ~158-~158: As a shorter alternative for ‘able to’, consider using “can”.
Context: ... attention in TensorRT-LLM. That kernel is able to apply pre-processing on the Q, K, and V...

(BE_ABLE_TO)


[style] ~168-~168: To elevate your writing, try using a synonym here.
Context: ...efinition of 'small' in that context is hard to quantify because it depends on the m...

(HARD_TO)


[style] ~254-~254: Consider using a synonym to be more concise.
Context: ...tion of many output tokens, it may take a lot of steps to reach that point). #### Paged...

(A_LOT_OF)


[style] ~312-~312: Consider removing “of” to be more concise
Context: ...implementation of StreamingLLM. It uses all of the tokens for computation and only saves `...

(ALL_OF_THE)

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

[style] ~115-~115: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~127-~127: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~142-~142: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~146-~146: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~156-~156: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~160-~160: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~172-~172: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~178-~178: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~303-~303: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

docs/source/developer-guide/perf-analysis.md

[style] ~69-~69: In American English, abbreviations like “etc.” require a period.
Context: ...g, such as GIL, debugging NVTX markers, etc: ```bash #!/bin/bash # Prepare datase...

(ETC_PERIOD)

docs/source/developer-guide/perf-benchmarking.md

[style] ~13-~13: Consider using a more concise synonym.
Context: ...uild tuned engines for benchmarking for a variety of models and platforms. - An entirely Pyt...

(A_VARIETY_OF)


[style] ~147-~147: Consider a more concise word here.
Context: ...plete JSON entry is on every line. ``` In order to prepare a synthetic dataset, you can us...

(IN_ORDER_TO_PREMIUM)


[style] ~325-~325: ‘due to the fact that’ might be wordy. Consider a shorter alternative.
Context: ...re texts and not tokenized ids. This is due to the fact that the prompt and the media (image/video...

(EN_WORDINESS_PREMIUM_DUE_TO_THE_FACT_THAT)


[grammar] ~459-~459: Ensure spelling is correct
Context: ...If you would like to force the KV cache quantizaton, you can specify the following in the Y...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/source/features/disagg-serving.md

[style] ~202-~202: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. * `TRTLLM_KVCACHE_SEND_MAX_CONCURRE...

(MORE_THAN_EXCEEDS)


[style] ~204-~204: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. There are some other useful environ...

(MORE_THAN_EXCEEDS)


[style] ~208-~208: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...

(MAKE_SURE_ENSURE)

docs/source/features/kvcache.md

[style] ~3-~3: Consider a more concise word here.
Context: ...value pairs for reuse during generation in order to avoid redundant calculations. The Tenso...

(IN_ORDER_TO_PREMIUM)


[style] ~7-~7: Consider a more concise word here.
Context: ...tention window size and number of heads in order to support variable attention window size ...

(IN_ORDER_TO_PREMIUM)


[style] ~15-~15: Consider a more concise word here.
Context: ...yers with limited attention window size in order to reduce computations and memory usage. B...

(IN_ORDER_TO_PREMIUM)


[style] ~19-~19: Consider a more concise word here.
Context: ...es advantage of grouped query attention in order to save memory. KV cache will create block...

(IN_ORDER_TO_PREMIUM)

docs/source/features/multi-modality.md

[style] ~3-~3: Consider using a more concise synonym.
Context: ... in TensorRT-LLM TensorRT-LLM supports a variety of multimodal models, enabling efficient i...

(A_VARIETY_OF)

docs/source/features/speculative-decoding.md

[style] ~11-~11: Consider removing “of” to be more concise
Context: ...lementation. The table below enumerates all of the algorithm/model combinations that are s...

(ALL_OF_THE)


[style] ~94-~94: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s currently only supported by Deepseek. MTP can be tuned with the following configu...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Consider removing “of” to be more concise
Context: ...ts of a speculative decoding algorithm. All of the interfaces are defined in [`_torch/spec...

(ALL_OF_THE)


[style] ~161-~161: Consider removing “of” to be more concise
Context: ... is problematic. Ideally, we would move all of the kernel dispatching logic to a lower lev...

(ALL_OF_THE)


[style] ~187-~187: Consider removing “of” to be more concise
Context: ...et model as a submodule. 7. Drafter: All of the logic required to actually produce draf...

(ALL_OF_THE)

Comment on lines +130 to +139
```python
from tensorrt_llm._torch.models.checkpoints.hf.gemma3_weight_mapper import \
Gemma3HfWeightMapper
from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM

gemma3 = Gemma3ForCausalLM(model_config)
weight_mapper = Gemma3HfWeightMapper()
weight_mapper.init_model_and_config(gemma3, model_config)
gemma3.load_weights(hf_gemma3.state_dict(), weight_mapper)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Undefined variable in example (hf_gemma3).

Replace with a defined HF model object, or clarify it’s the source model.

-from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM
+from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM
+from transformers import AutoModelForCausalLM
@@
-gemma3 = Gemma3ForCausalLM(model_config)
-weight_mapper = Gemma3HfWeightMapper()
-weight_mapper.init_model_and_config(gemma3, model_config)
-gemma3.load_weights(hf_gemma3.state_dict(), weight_mapper)
+gemma3 = Gemma3ForCausalLM(model_config)
+weight_mapper = Gemma3HfWeightMapper()
+weight_mapper.init_model_and_config(gemma3, model_config)
+# Load the HF source model to obtain a state dict:
+hf_model = AutoModelForCausalLM.from_pretrained(hf_model_dir, torch_dtype="auto")
+gemma3.load_weights(hf_model.state_dict(), weight_mapper)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```python
from tensorrt_llm._torch.models.checkpoints.hf.gemma3_weight_mapper import \
Gemma3HfWeightMapper
from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM
gemma3 = Gemma3ForCausalLM(model_config)
weight_mapper = Gemma3HfWeightMapper()
weight_mapper.init_model_and_config(gemma3, model_config)
gemma3.load_weights(hf_gemma3.state_dict(), weight_mapper)
```
from tensorrt_llm._torch.models.checkpoints.hf.gemma3_weight_mapper import \
Gemma3HfWeightMapper
from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM
from transformers import AutoModelForCausalLM
gemma3 = Gemma3ForCausalLM(model_config)
weight_mapper = Gemma3HfWeightMapper()
weight_mapper.init_model_and_config(gemma3, model_config)
# Load the HF source model to obtain a state dict:
hf_model = AutoModelForCausalLM.from_pretrained(hf_model_dir, torch_dtype="auto")
gemma3.load_weights(hf_model.state_dict(), weight_mapper)
🤖 Prompt for AI Agents
In docs/source/features/checkpoint-loading.md around lines 130 to 139, the
example code uses an undefined variable `hf_gemma3`. Replace `hf_gemma3` with a
properly defined Hugging Face model object or add a comment clarifying that it
represents the source HF model whose state_dict is being loaded. Ensure the
variable is defined or clearly explained to avoid confusion.

Comment on lines +3 to +12
- [Motivation](#Motivation)
- [KV Cache Exchange](#KV-Cache-Exchange)
- [Multi-backend Support](#Multi-backend-Support)
- [Overlap Optimization](#Overlap-Optimization)
- [Cache Layout Transformation](#Cache-Layout-Transformation)
- [Usage](#Usage)
- [trtllm-serve](#trtllm-serve)
- [Dynamo](#Dynamo)
- [Environment Variables](#Environment-Variables)
- [Troubleshooting and FAQ](#Troubleshooting-and-FAQ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix ToC anchor fragments (markdownlint MD051).

HTML IDs are lowercase-kebab-case. Update the fragment links to match generated anchors.

-- [Motivation](#Motivation)
-- [KV Cache Exchange](#KV-Cache-Exchange)
-  - [Multi-backend Support](#Multi-backend-Support)
-  - [Overlap Optimization](#Overlap-Optimization)
-  - [Cache Layout Transformation](#Cache-Layout-Transformation)
-- [Usage](#Usage)
-  - [trtllm-serve](#trtllm-serve)
-  - [Dynamo](#Dynamo)
-- [Environment Variables](#Environment-Variables)
-- [Troubleshooting and FAQ](#Troubleshooting-and-FAQ)
+- [Motivation](#motivation)
+- [KV Cache Exchange](#kv-cache-exchange)
+  - [Multi-backend Support](#multi-backend-support)
+  - [Overlap Optimization](#overlap-optimization)
+  - [Cache Layout Transformation](#cache-layout-transformation)
+- [Usage](#usage)
+  - [trtllm-serve](#trtllm-serve)
+  - [Dynamo](#dynamo)
+- [Environment Variables](#environment-variables)
+- [Troubleshooting and FAQ](#troubleshooting-and-faq)
-Please refer to the following section for details [Environment Variables](#Environment-Variables).
+Please refer to the following section for details [Environment Variables](#environment-variables).

Also applies to: 78-78

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

3-3: Link fragments should be valid
Expected: #motivation; Actual: #Motivation

(MD051, link-fragments)


4-4: Link fragments should be valid
Expected: #kv-cache-exchange; Actual: #KV-Cache-Exchange

(MD051, link-fragments)


5-5: Link fragments should be valid
Expected: #multi-backend-support; Actual: #Multi-backend-Support

(MD051, link-fragments)


6-6: Link fragments should be valid
Expected: #overlap-optimization; Actual: #Overlap-Optimization

(MD051, link-fragments)


7-7: Link fragments should be valid
Expected: #cache-layout-transformation; Actual: #Cache-Layout-Transformation

(MD051, link-fragments)


8-8: Link fragments should be valid
Expected: #usage; Actual: #Usage

(MD051, link-fragments)


10-10: Link fragments should be valid
Expected: #dynamo; Actual: #Dynamo

(MD051, link-fragments)


11-11: Link fragments should be valid
Expected: #environment-variables; Actual: #Environment-Variables

(MD051, link-fragments)


12-12: Link fragments should be valid
Expected: #troubleshooting-and-faq; Actual: #Troubleshooting-and-FAQ

(MD051, link-fragments)

🤖 Prompt for AI Agents
In docs/source/features/disagg-serving.md around lines 3 to 12 and line 78, the
Table of Contents anchor links use uppercase letters and camel case, which do
not match the lowercase-kebab-case format generated by markdown. Update all
anchor fragment links to lowercase and replace spaces with hyphens to conform to
lowercase-kebab-case, ensuring the links correctly navigate to the corresponding
sections.

Comment on lines +35 to +37
Quickly try out TensorRT-LLM's multimodal support using our `LLM-API` and a ready-to-run [example](/examples/llm-api/quickstart_multimodal.py):

```bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix all absolute “/docs/…” links

Links like /examples/llm-api/quickstart_multimodal.py, /docs/source/commands/trtllm-serve.rst, and /docs/source/performance/perf-benchmarking.md will 404 once the site is built because Sphinx treats them as external URLs. Replace with:

:doc:`commands/trtllm-serve`
:doc:`performance/perf-benchmarking`

or plain relative paths without the leading “/”.

Also applies to: 41-47, 49-53

🤖 Prompt for AI Agents
In docs/source/features/multi-modality.md around lines 35 to 37, the links use
absolute paths starting with "/", which will cause 404 errors after the site
build because Sphinx treats them as external URLs. Replace these absolute links
with either the Sphinx :doc: directive for referencing documentation pages or
use relative paths without the leading "/". Also apply this fix to lines 41-47
and 49-53 where similar absolute links appear.


[Previously](./useful-build-time-flags.md#paged-context-attention) we recommended enabling paged context attention even though in our case study it didn't affect performance significantly. Now that we understand the TensorRT-LLM scheduler, we can explain why this is beneficial. In short, we recommend enabling it because it enables context chunking, which allows the context phase of a request to be broken up into pieces and processed over several execution iterations, allowing the engine to provide a more stable balance of context and generation phase execution.

The [visualization](#understanding-the-trt-llm-scheduler) of the TensorRT-LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Broken local anchor.

The anchor “#understanding-the-trt-llm-scheduler” doesn’t exist. The section is “The schedulers”.

-The [visualization](#understanding-the-trt-llm-scheduler) of the TensorRT-LLM scheduler showed ...
+The [visualization](#the-schedulers) of the TensorRT-LLM scheduler showed ...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The [visualization](#understanding-the-trt-llm-scheduler) of the TensorRT-LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.
The [visualization](#the-schedulers) of the TensorRT-LLM scheduler showed that initially Request 3 couldnt be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

141-141: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
In docs/source/features/paged-attention-ifb-scheduler.md at line 141, the link
anchor "#understanding-the-trt-llm-scheduler" is broken because the correct
section header is "The schedulers". Update the anchor in the link to match the
exact section title by changing it to "#the-schedulers" to fix the broken local
anchor.

Comment on lines +41 to +47
from tensorrt_llm.llmapi import DraftTargetDecodingConfig

speculative_config = DraftTargetDecodingConfig(
max_draft_len=3, speculative_model="/path/to/draft_model")

llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make code snippets copy-pasteable: add missing LLM import.

Each snippet uses LLM but doesn’t import it.

-from tensorrt_llm.llmapi import DraftTargetDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import DraftTargetDecodingConfig
-from tensorrt_llm.llmapi import EagleDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import EagleDecodingConfig
-from tensorrt_llm.llmapi import NGramDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import NGramDecodingConfig
-from tensorrt_llm.llmapi import MTPDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import MTPDecodingConfig
-from tensorrt_llm.llmapi import UserProvidedDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import UserProvidedDecodingConfig

Also applies to: 59-69, 84-90, 105-111, 120-126

🤖 Prompt for AI Agents
In docs/source/features/speculative-decoding.md around lines 41 to 47, the code
snippet uses the LLM class without importing it, which prevents direct
copy-pasting. Add an import statement for LLM at the beginning of each affected
snippet, including lines 59-69, 84-90, 105-111, and 120-126, to ensure the code
is complete and runnable.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/1.0_doc branch from 1ff7743 to 050db34 Compare August 8, 2025 09:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (19)
docs/source/features/multi-modality.md (1)

27-53: Replace absolute paths with Sphinx cross-refs.

The links on Lines 27, 35, 41, 49, 51 and 53 still start with “/…”.
Sphinx treats these as external URLs → 404 after site build.
Use :doc: (for docs) or remove the leading slash.

Example:

-Please refer to the latest multimodal [support matrix](/docs/source/reference/support-matrix.md).
+Please refer to the latest multimodal :doc:`support matrix <reference/support-matrix>`.

Same fix applies to every absolute path listed above.

docs/source/developer-guide/perf-analysis.md (1)

81-88: Remove the backslash before the inline comment.

The backslash on Line 87 escapes the following space, so the comment becomes part of the command and breaks copy-paste execution.

-  trtllm-bench \ # or trtllm-serve command
+  # Replace with `trtllm-serve` if desired
+  trtllm-bench \
docs/source/features/paged-attention-ifb-scheduler.md (2)

76-79: Out-of-tree link still points to removed performance/ folder
Please update ../performance/perf-best-practices.md#chunked-context to the new location under developer-guide/.


141-141: Broken local anchor
#understanding-the-trt-llm-scheduler no longer exists – change to #the-schedulers.

docs/source/features/sampling.md (1)

74-80: Parameter name typo breaks copy-paste usage
strucutral_tagstructural_tag. Also switch list bullet to dash for MD004 compliance.

docs/source/features/disagg-serving.md (2)

3-12: Fix ToC fragment casing to match generated ids
All fragment links should be lowercase-kebab-case (markdownlint MD051).


78-78: Anchor in body still uses incorrect casing
[Environment Variables](#Environment-Variables)#environment-variables.

docs/source/developer-guide/perf-benchmarking.md (2)

17-18: Dangling markdown link
[in-flight batching] still lacks a destination; add the target URL as done in prior suggestion.


117-118: input_ids are token IDs, not logits
Table description is still incorrect – replace “List of logits” with “Token IDs”.

docs/source/features/speculative-decoding.md (8)

58-69: Make the snippet copy-pasteable and clarify overlap scheduler flag

  • Add missing LLM import.
  • Suggest passing disable_overlap_scheduler = (not eagle3_one_model) to match the note above.
-from tensorrt_llm.llmapi import EagleDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import EagleDecodingConfig
@@
-# Only need to disable overlap scheduler if eagle3_one_model is False.
-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+# Only need to disable overlap scheduler if eagle3_one_model is False.
+llm = LLM(
+    "/path/to/target_model",
+    speculative_config=speculative_config,
+    disable_overlap_scheduler=(not eagle3_one_model),
+)

84-84: Fix incorrect import path (llm_apillmapi)

Use the actual package path to avoid copy/paste import errors.

-from tensorrt_llm.llm_api import NGramDecodingConfig
+from tensorrt_llm.llmapi import NGramDecodingConfig

105-105: Fix incorrect import path (llm_apillmapi)

Align with the real module path.

-from tensorrt_llm.llm_api import MTPDecodingConfig
+from tensorrt_llm.llmapi import MTPDecodingConfig

120-120: Fix incorrect import path (llm_apillmapi)

Prevent import error in user-provided drafting snippet.

-from tensorrt_llm.llm_api import UserProvidedDecodingConfig
+from tensorrt_llm.llmapi import UserProvidedDecodingConfig

40-47: Make the snippet copy-pasteable: add missing LLM import

Examples instantiate LLM but don’t import it.

-from tensorrt_llm.llmapi import DraftTargetDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import DraftTargetDecodingConfig

83-90: Make the NGram snippet copy-pasteable: add missing LLM import

Add LLM import so users can run the snippet as-is.

-from tensorrt_llm.llmapi import NGramDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import NGramDecodingConfig

104-111: Make the MTP snippet copy-pasteable: add missing LLM import

Add LLM import for a complete example.

-from tensorrt_llm.llmapi import MTPDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import MTPDecodingConfig

119-126: Make the user-provided drafting snippet copy-pasteable: add missing LLM import

Add LLM import for completeness.

-from tensorrt_llm.llmapi import UserProvidedDecodingConfig
+from tensorrt_llm import LLM
+from tensorrt_llm.llmapi import UserProvidedDecodingConfig
docs/source/features/attention.md (1)

262-264: Fix invalid Sphinx link (build-breaking)

Replace custom source: scheme with a resolvable cross-reference.

-See the implementation of
-[`KVCacheManager`](source:tensorrt_llm/_torch/pyexecutor/resource_manager.py).
+See :py:class:`tensorrt_llm._torch.pyexecutor.resource_manager.KVCacheManager`
+for implementation details.
docs/source/features/checkpoint-loading.md (1)

135-139: Fix undefined variable in example (hf_gemma3)

Instantiate an HF model and use its state_dict().

-from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM
+from tensorrt_llm._torch.models.modeling_gemma3 import Gemma3ForCausalLM
+from transformers import AutoModelForCausalLM
@@
 gemma3 = Gemma3ForCausalLM(model_config)
 weight_mapper = Gemma3HfWeightMapper()
 weight_mapper.init_model_and_config(gemma3, model_config)
-gemma3.load_weights(hf_gemma3.state_dict(), weight_mapper)
+# Load the HF source model to obtain a state dict:
+hf_model = AutoModelForCausalLM.from_pretrained(hf_model_dir, torch_dtype="auto")
+gemma3.load_weights(hf_model.state_dict(), weight_mapper)
🧹 Nitpick comments (20)
docs/source/features/kvcache.md (1)

3-20: Trim overly long sentences for readability.

Several paragraphs chain multiple clauses and repeatedly use “in order to …”.
Consider splitting into shorter sentences and dropping the phrase for crisper docs.

Example:

-The KV cache stores previously computed key-value pairs for reuse during generation in order to avoid redundant calculations.
+The KV cache stores previously computed key-value pairs so generation can skip redundant calculations.

Same for Lines 7, 15 and 19.

docs/source/developer-guide/perf-analysis.md (1)

12-24: Fix list indentation (markdown-lint MD007).

Bullets under “The main functionality” and the following section are indented two spaces instead of zero, tripping MD007. Align to column 0 to satisfy markdown-lint.

docs/source/features/parallel-strategy.md (1)

51-54: Clarify terminology “Hybrid ETP”.

The label “Hybrid ETP” appears only here.
Consider renaming to “Hybrid (TP × EP)” for consistency with the explanatory sentence above.

docs/source/features/paged-attention-ifb-scheduler.md (1)

92-96: Add language identifier to fenced block
Markdown-lint MD040 fires because the code fence after “Contiguous KV cache is a monolithic tensor” has no language tag.

-```
+```text
 [max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].

</blockquote></details>
<details>
<summary>docs/source/features/disagg-serving.md (1)</summary><blockquote>

`151-151`: **Spelling: “serfvers” → “servers”**  
Minor typo in explanatory sentence.

</blockquote></details>
<details>
<summary>docs/source/developer-guide/perf-benchmarking.md (1)</summary><blockquote>

`128-139`: **Specify language for JSON examples**  
Add `json` after the opening back-ticks to silence MD040 and improve rendering.

</blockquote></details>
<details>
<summary>docs/source/features/speculative-decoding.md (2)</summary><blockquote>

`235-239`: **Fix typo and grammar in verification description**

Correct “drat” → “draft”; add missing pronoun for clarity.


```diff
-Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if
-matches the previously decoded token exactly. For example, suppose there is a generation request
-[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are drat tokens. Suppose the token after `t` is `d1`
+Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if
+it matches the previously decoded token exactly. For example, suppose there is a generation request
+`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are draft tokens. Suppose the token after `t` is `d1`

11-11: Minor style: “all of the” → “all the”

Concise phrasing reads better.

-The table below enumerates all of the algorithm/model combinations that are supported.
+The table below enumerates all the algorithm/model combinations that are supported.
docs/source/features/attention.md (5)

189-191: Fix unordered list indentation (markdownlint MD007)

Remove the extra leading space to match expected indentation.

- - FP16 / BF16 compute data type.
- - FP16 / BF16 / FP8 / INT8 KV cache data type.
- - Paged KV cache (8 / 16 / 32 / 64 / 128 tokens per block).
+- FP16 / BF16 compute data type.
+- FP16 / BF16 / FP8 / INT8 KV cache data type.
+- Paged KV cache (8 / 16 / 32 / 64 / 128 tokens per block).

248-251: Add language to fenced code block (markdownlint MD040)

Specify a language (use text) for the shape snippet.

-```
+```text
 [max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].

---

`382-385`: **Add language to fenced code block (markdownlint MD040)**

Specify `text` for the formula snippet.


```diff
-```
+```text
 norm_factor = 1.f / (q_scaling * sqrt(head_size)).

---

`219-220`: **Consistent emphasis style (markdownlint MD049)**

Use asterisks for emphasis to match style checks.


```diff
-_(1) Padding sequences in the generation phase, that contain a single token, to
+*(1) Padding sequences in the generation phase, that contain a single token, to
 the length of the maximum input sequence is inefficient use of resources_.

68-68: Minor style: “outside of the backend” → “outside the backend”

Concise phrasing.

-| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside of the backend. |
+| position_ids | Optional[Tensor] | The position of each token in each sequence. May be None if positional embedding is applied outside the backend. |
docs/source/features/checkpoint-loading.md (1)

326-326: Fix typo: “asscoiated” → “associated”

-By setting the model name, the registered mapper will be asscoiated with the specific model.
+By setting the model name, the registered mapper will be associated with the specific model.
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (6)

1-1: Use consistent model naming in the title

Prefer “Llama 3.3-70B” (with space and hyphen) for consistency with the rest of the doc and HF model card naming.

-# Quick Start Recipe for Llama3.3 70B on TensorRT-LLM - Blackwell & Hopper Hardware
+# Quick Start Recipe for Llama 3.3-70B on TensorRT-LLM - Blackwell & Hopper

27-27: Punctuation nit: end the NVFP4 support note with a period

Tiny clarity/consistency fix.

-Note that NVFP4 is only supported on NVIDIA Blackwell
+Note that NVFP4 is only supported on NVIDIA Blackwell.

255-255: Fix typo and improve wording (LLama → Llama, clarity on note)

Minor correctness and clarity improvement.

-* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set add\_special\_tokens=False to avoid it.
+* Note: The tokenizer adds a BOS token before the input prompt by default, which leads to accuracy regression on the GSM8K task for Llama 3.3 70B Instruct. Set `add_special_tokens=False` to avoid it.

274-274: Grammar tweak in benchmarking intro sentence

Small readability fix.

-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in “benchmark\_serving.py” script. To do this first creating a wrapper [bench.sh](http://bench.sh) script.
+To benchmark your TensorRT-LLM server, use the built-in “benchmark_serving.py” script. First, create a small wrapper script (bench.sh).

277-283: Result directory is defined but not used in the initial benchmark loop

Either remove result_dir here or add the --save-result flags in the loop to avoid confusion. Keeping the “save results” flags in the main example helps users preserve measurements by default.

Proposed change inside the loop (after the --percentile-metrics line):

         --percentile-metrics "ttft,tpot,itl,e2el"
+        --save-result \
+        --result-dir "${result_dir}" \
+        --result-filename "concurrency_${concurrency}.json"

Alternatively, remove the result_dir variable from the header if you prefer the current minimal example.


110-130: Style nit: repeated “Description” sentence starts

Not blocking, but you can vary sentence openings slightly to avoid repetition flagged by linters. No functional impact.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ff7743 and 050db34.

⛔ Files ignored due to path filters (2)
  • docs/source/features/media/feat_long_seq_chunked_attention.png is excluded by !**/*.png
  • docs/source/features/media/feat_long_seq_sliding_win_attn.png is excluded by !**/*.png
📒 Files selected for processing (30)
  • docs/source/commands/trtllm-eval.rst (1 hunks)
  • docs/source/conf.py (1 hunks)
  • docs/source/deployment-guide/index.rst (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (3 hunks)
  • docs/source/developer-guide/perf-analysis.md (1 hunks)
  • docs/source/developer-guide/perf-benchmarking.md (1 hunks)
  • docs/source/examples/dynamo_k8s_example.rst (1 hunks)
  • docs/source/features/attention.md (1 hunks)
  • docs/source/features/checkpoint-loading.md (1 hunks)
  • docs/source/features/disagg-serving.md (1 hunks)
  • docs/source/features/feature-combination-matrix.md (1 hunks)
  • docs/source/features/kvcache.md (1 hunks)
  • docs/source/features/long-sequence.md (1 hunks)
  • docs/source/features/lora.md (1 hunks)
  • docs/source/features/multi-modality.md (1 hunks)
  • docs/source/features/overlap-scheduler.md (1 hunks)
  • docs/source/features/paged-attention-ifb-scheduler.md (1 hunks)
  • docs/source/features/parallel-strategy.md (1 hunks)
  • docs/source/features/quantization.md (1 hunks)
  • docs/source/features/sampling.md (1 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
  • docs/source/index.rst (1 hunks)
  • docs/source/installation/index.rst (1 hunks)
  • docs/source/models/adding-new-model.md (1 hunks)
  • docs/source/models/supported-models.md (1 hunks)
  • docs/source/overview.md (2 hunks)
  • docs/source/quick-start-guide.md (3 hunks)
  • docs/source/reference/support-matrix.md (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • docs/source/deployment-guide/index.rst
  • docs/source/conf.py
  • docs/source/features/feature-combination-matrix.md
  • docs/source/commands/trtllm-eval.rst
🚧 Files skipped from review as they are similar to previous changes (13)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
  • docs/source/reference/support-matrix.md
  • docs/source/installation/index.rst
  • docs/source/features/overlap-scheduler.md
  • docs/source/examples/dynamo_k8s_example.rst
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
  • docs/source/quick-start-guide.md
  • docs/source/index.rst
  • docs/source/overview.md
  • docs/source/models/supported-models.md
  • docs/source/models/adding-new-model.md
  • docs/source/features/quantization.md
  • docs/source/features/lora.md
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • docs/source/features/parallel-strategy.md
  • docs/source/features/long-sequence.md
  • docs/source/features/speculative-decoding.md
  • docs/source/features/attention.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/multi-modality.md
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/features/parallel-strategy.md
  • docs/source/features/checkpoint-loading.md
  • docs/source/features/speculative-decoding.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/developer-guide/perf-benchmarking.md
  • docs/source/features/disagg-serving.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/features/parallel-strategy.md
  • docs/source/features/long-sequence.md
  • docs/source/features/speculative-decoding.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
  • docs/source/developer-guide/perf-analysis.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • docs/source/features/parallel-strategy.md
  • docs/source/features/paged-attention-ifb-scheduler.md
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/features/parallel-strategy.md
  • docs/source/features/paged-attention-ifb-scheduler.md
  • docs/source/features/long-sequence.md
  • docs/source/features/speculative-decoding.md
  • docs/source/features/kvcache.md
  • docs/source/features/multi-modality.md
  • docs/source/features/sampling.md
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • docs/source/features/speculative-decoding.md
  • docs/source/features/disagg-serving.md
🪛 LanguageTool
docs/source/features/paged-attention-ifb-scheduler.md

[style] ~123-~123: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...x num token budget of 12, the scheduler is able to schedule two additional requests, Reque...

(BE_ABLE_TO)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...support. For long-context cases this is extremely important, because setting extremely large values...

(EN_WEAK_ADJECTIVE)


[style] ~145-~145: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase.
Context: ...is extremely important, because setting extremely large values of max-num tokens takes away fro...

(EN_WEAK_ADJECTIVE)

docs/source/features/checkpoint-loading.md

[style] ~146-~146: Consider using a different adverb to strengthen your wording.
Context: ...ent all four components when supporting a completely new checkpoint format - **Custom Weight...

(COMPLETELY_ENTIRELY)


[grammar] ~326-~326: Ensure spelling is correct
Context: ...del name, the registered mapper will be asscoiated with the specific model.

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/source/features/speculative-decoding.md

[style] ~11-~11: Consider removing “of” to be more concise
Context: ...lementation. The table below enumerates all of the algorithm/model combinations that are s...

(ALL_OF_THE)


[style] ~94-~94: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s currently only supported by Deepseek. MTP can be tuned with the following configu...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Consider removing “of” to be more concise
Context: ...ts of a speculative decoding algorithm. All of the interfaces are defined in [`_torch/spec...

(ALL_OF_THE)


[style] ~161-~161: Consider removing “of” to be more concise
Context: ... is problematic. Ideally, we would move all of the kernel dispatching logic to a lower lev...

(ALL_OF_THE)


[style] ~187-~187: Consider removing “of” to be more concise
Context: ...et model as a submodule. 7. Drafter: All of the logic required to actually produce draf...

(ALL_OF_THE)


[grammar] ~237-~237: Ensure spelling is correct
Context: ...2, d3], where d1, d2, and d3are drat tokens. Suppose the token aftert` is ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

docs/source/features/attention.md

[style] ~68-~68: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...osition of each token in each sequence. May be None if positional embedding is appl...

(MISSING_IT_THERE)


[style] ~68-~68: This phrase is redundant. Consider using “outside”.
Context: ...None if positional embedding is applied outside of the backend. | | request_ids | List[int...

(OUTSIDE_OF)


[style] ~131-~131: ‘in proportion to’ might be wordy. Consider a shorter alternative.
Context: ...int is significant (grows quadratically in proportion to the sequence length). The `TrtllmAtten...

(EN_WORDINESS_PREMIUM_IN_PROPORTION_TO)


[style] ~158-~158: As a shorter alternative for ‘able to’, consider using “can”.
Context: ... attention in TensorRT-LLM. That kernel is able to apply pre-processing on the Q, K, and V...

(BE_ABLE_TO)


[style] ~168-~168: To elevate your writing, try using a synonym here.
Context: ...efinition of 'small' in that context is hard to quantify because it depends on the m...

(HARD_TO)


[style] ~254-~254: Consider using a synonym to be more concise.
Context: ...tion of many output tokens, it may take a lot of steps to reach that point). #### Paged...

(A_LOT_OF)


[style] ~312-~312: Consider removing “of” to be more concise
Context: ...implementation of StreamingLLM. It uses all of the tokens for computation and only saves `...

(ALL_OF_THE)

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

[style] ~115-~115: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~123-~123: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~127-~127: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~142-~142: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~146-~146: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~156-~156: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~160-~160: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~162-~162: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~172-~172: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~178-~178: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~303-~303: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

docs/source/developer-guide/perf-analysis.md

[style] ~69-~69: In American English, abbreviations like “etc.” require a period.
Context: ...g, such as GIL, debugging NVTX markers, etc: ```bash #!/bin/bash # Prepare datase...

(ETC_PERIOD)

docs/source/developer-guide/perf-benchmarking.md

[style] ~13-~13: Consider using a more concise synonym.
Context: ...uild tuned engines for benchmarking for a variety of models and platforms. - An entirely Pyt...

(A_VARIETY_OF)


[style] ~147-~147: Consider a more concise word here.
Context: ...plete JSON entry is on every line. ``` In order to prepare a synthetic dataset, you can us...

(IN_ORDER_TO_PREMIUM)


[style] ~325-~325: ‘due to the fact that’ might be wordy. Consider a shorter alternative.
Context: ...re texts and not tokenized ids. This is due to the fact that the prompt and the media (image/video...

(EN_WORDINESS_PREMIUM_DUE_TO_THE_FACT_THAT)

docs/source/features/disagg-serving.md

[grammar] ~151-~151: Ensure spelling is correct
Context: ...when routing requests to the generation serfvers, the disaggregated server will mark the...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~202-~202: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. * `TRTLLM_KVCACHE_SEND_MAX_CONCURRE...

(MORE_THAN_EXCEEDS)


[style] ~204-~204: This phrasing could be wordy, so try replacing it with something more concise.
Context: ...n TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE is greater than 0. There are some other useful environ...

(MORE_THAN_EXCEEDS)


[style] ~208-~208: Consider using a different verb to strengthen your wording.
Context: ... the number of CUDA streams, but please make sure there are no other NCCL ops outside the...

(MAKE_SURE_ENSURE)

docs/source/features/kvcache.md

[style] ~3-~3: Consider a more concise word here.
Context: ...value pairs for reuse during generation in order to avoid redundant calculations. The Tenso...

(IN_ORDER_TO_PREMIUM)


[style] ~7-~7: Consider a more concise word here.
Context: ...tention window size and number of heads in order to support variable attention window size ...

(IN_ORDER_TO_PREMIUM)


[style] ~15-~15: Consider a more concise word here.
Context: ...yers with limited attention window size in order to reduce computations and memory usage. B...

(IN_ORDER_TO_PREMIUM)


[style] ~19-~19: Consider a more concise word here.
Context: ...es advantage of grouped query attention in order to save memory. KV cache will create block...

(IN_ORDER_TO_PREMIUM)

docs/source/features/multi-modality.md

[style] ~3-~3: Consider using a more concise synonym.
Context: ... in TensorRT-LLM TensorRT-LLM supports a variety of multimodal models, enabling efficient i...

(A_VARIETY_OF)

🪛 markdownlint-cli2 (0.17.2)
docs/source/features/paged-attention-ifb-scheduler.md

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


141-141: Link fragments should be valid

(MD051, link-fragments)

docs/source/features/attention.md

189-189: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


190-190: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


191-191: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


219-219: Emphasis style
Expected: asterisk; Actual: underscore

(MD049, emphasis-style)


220-220: Emphasis style
Expected: asterisk; Actual: underscore

(MD049, emphasis-style)


248-248: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


382-382: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/source/developer-guide/perf-analysis.md

13-13: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


18-18: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

docs/source/developer-guide/perf-benchmarking.md

159-159: Link fragments should be valid

(MD051, link-fragments)


234-234: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


273-273: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


293-293: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


313-313: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


329-329: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


349-349: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/source/features/sampling.md

78-78: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)

175-181: Add cross-link to the attention backend guide

The section references attention backends; adding the dedicated guide improves discoverability. The AI summary also mentions this linkage.

 See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class for the full list of options which can be used in the `extra_llm_api_options`.
+For details on attention implementations and trade-offs, see the [Attention backends guide](../features/attention.md).

Likely an incorrect or invalid review comment.

Comment on lines +46 to +51
Note:

* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
* The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Include Hugging Face token/credentials guidance for gated models

Without passing a token or mounting credentials, first-time downloads of Llama 3.3 are likely to fail. Add explicit steps here.

 Note: 
 * You can mount additional directories and paths using the \-v \<local_path>:\<path> flag if needed, such as mounting the downloaded weight paths.  
 * The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to ~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the ~/.cache directory doesn’t exist please create it using  mkdir ~/.cache  
 * The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host  
 * See the https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
+* If the model is gated on Hugging Face, provide your credentials: either run `huggingface-cli login` inside the container or pass your token via `-e HUGGING_FACE_HUB_TOKEN=...`. To persist credentials, mount your HF config: `-v ~/.huggingface:/root/.huggingface:ro`.
+* For least-privilege mounting, consider binding only `~/.cache/huggingface` instead of the entire `~/.cache`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Note:
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
* The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
Note:
* You can mount additional directories and paths using the \-v \<local_path>:\<path> flag if needed, such as mounting the downloaded weight paths.
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to ~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the ~/.cache directory doesn’t exist please create it using mkdir ~/.cache
* The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host
* See the https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
* If the model is gated on Hugging Face, provide your credentials: either run `huggingface-cli login` inside the container or pass your token via `-e HUGGING_FACE_HUB_TOKEN=...`. To persist credentials, mount your HF config: `-v ~/.huggingface:/root/.huggingface:ro`.
* For least-privilege mounting, consider binding only `~/.cache/huggingface` instead of the entire `~/.cache`.
🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
around lines 46 to 51, the instructions lack guidance on providing Hugging Face
tokens or credentials needed for gated model downloads. Add explicit steps
explaining how to obtain a Hugging Face access token and how to pass it to the
container, either by setting environment variables or mounting credential files,
to ensure first-time downloads of Llama 3.3 succeed without authentication
errors.

Comment on lines +76 to +89
```shell
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1024 \
--max_num_tokens 2048 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 1 \
--ep_size 1 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Call out realistic TP sizing for 70B on Hopper vs. Blackwell

70B typically does not fit on a single Hopper GPU. Add a short inline note to prevent OOM/confusion for first-time users.

 ```shell
+# IMPORTANT: For 70B on Hopper, set --tp_size to your GPU count (e.g., 8 on 8xH100 80GB).
+# On Blackwell B200 192GB, --tp_size=1 may be feasible depending on settings.
 trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
     --host 0.0.0.0 \
     --port 8000 \
     --backend pytorch \
     --max_batch_size 1024 \
     --max_num_tokens 2048 \
     --max_seq_len 2048 \
     --kv_cache_free_gpu_memory_fraction 0.9 \
     --tp_size 1 \
     --ep_size 1 \
     --trust_remote_code \
     --extra_llm_api_options ${EXTRA_LLM_API_FILE}
Additionally, consider adding a brief memory sizing table (H100 vs. H200 vs. B200) in a follow-up PR.

---

🌐 Web query:

```web
NVIDIA Hopper H100 80GB GPU memory capacity and NVIDIA Blackwell B200 GPU memory capacity

💡 Result:

The NVIDIA Hopper H100 GPU has a memory capacity of 80 GB of HBM3 memory[1][3][4][6][7][9], while the NVIDIA Blackwell B200 GPU features a much larger memory capacity of 192 GB of HBM3e memory[2][3][5][6][8][9].

Essential Details:

  • NVIDIA H100 (Hopper)

    • Memory Capacity: 80 GB HBM3 (SXM/PCIe models)[1][3][4][6][7][9]
    • Memory Bandwidth: 3.2–3.35 TB/s[1][3][6][7]
    • Available in both SXM and PCIe form factors, both providing 80 GB[3][4][7].
    • There are select H100 NVL models with 94 GB HBM3, but the standard and most widely referenced configuration is 80 GB[1].
  • NVIDIA B200 (Blackwell)

    • Memory Capacity: 192 GB HBM3e[2][3][5][6][8][9]
    • Memory Bandwidth: 8 TB/s[2][3][6]
    • The 192 GB is achieved by using two Blackwell chiplets, each with 96 GB (4× 24 GB HBM3e per chiplet)[2].
    • This capacity is more than double the H100's, and B200’s architecture is optimized for ultra-large model training and inference without needing to offload data to slower system memory[2][3][9].

Comparison Table:

GPU Model Memory Capacity Memory Type Memory Bandwidth
NVIDIA H100 80 GB (HBM3) HBM3 3.35 TB/s
NVIDIA B200 192 GB (HBM3e) HBM3e 8 TB/s
  • The B200's memory size and bandwidth represent a major leap over the H100, supporting much larger models and datasets directly in GPU memory, minimizing data transfer bottlenecks[2][3][6][9].

Other Notes:

  • For reference, the intermediate NVIDIA H200 (Hopper refresh) bridges the gap with 141 GB HBM3e, but still offers less capacity and bandwidth than the Blackwell B200[6][9].
  • NVIDIA DGX B200 systems, with 8 B200 GPUs, deliver 1.4 TB of aggregate GPU memory—this is 8 × 192 GB, confirming the per-GPU capacity[8].

In summary, the NVIDIA B200’s 192 GB HBM3e memory far exceeds the H100’s 80 GB HBM3, marking a significant generational upgrade for workloads requiring vast on-device memory[2][3][6][8][9].

Citations:


Add realistic TP sizing guidance for 70B on H100 vs. B200

To help first-time users avoid OOMs, insert a short inline note before the trtllm-serve command:

 ```shell
+# IMPORTANT: 70B models won’t fit on a single H100 (80 GB). Use tensor parallel across your GPUs (e.g., `--tp_size 8` on 8×H100 80 GB).
+# On a single Blackwell B200 (192 GB), `--tp_size=1` is typically sufficient.
 trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
     --host 0.0.0.0 \
     --port 8000 \
     --backend pytorch \
     --max_batch_size 1024 \
     --max_num_tokens 2048 \
     --max_seq_len 2048 \
     --kv_cache_free_gpu_memory_fraction 0.9 \
     --tp_size 1 \
     --ep_size 1 \
     --trust_remote_code \
     --extra_llm_api_options ${EXTRA_LLM_API_FILE}

• Consider adding a brief GPU memory sizing table (H100 80 GB vs. B200 192 GB) in a follow-up PR.

<details>
<summary>🤖 Prompt for AI Agents</summary>

In docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
around lines 76 to 89, add a short inline comment before the trtllm-serve
command explaining realistic tensor parallel (TP) sizing for 70B models on
different GPUs. Specifically, note that 70B models won't fit on a single H100
80GB GPU and recommend using tensor parallelism (e.g., --tp_size 8 for 8×H100
80GB), while on a single Blackwell B200 192GB GPU, --tp_size=1 is usually
sufficient. This guidance will help users avoid out-of-memory errors.


</details>

<!-- fingerprinting:phantom:triton:capybara -->

<!-- This is an auto-generated comment by CodeRabbit -->

```shell
docker exec -it tensorrt_llm /bin/bash

pip install lm_eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix installation command for the LM evaluation harness

The PyPI package is “lm-eval”; using “lm_eval” may fail.

-pip install lm_eval
+pip install -U lm-eval
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pip install lm_eval
pip install -U lm-eval
🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
at line 231, the installation command uses the incorrect package name "lm_eval".
Replace "pip install lm_eval" with "pip install lm-eval" to correctly install
the LM evaluation harness from PyPI.

Comment on lines +62 to +71
```python
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader

class CustomWeightLoader(BaseWeightLoader):
def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
# Load weights from your custom format
# Return a dictionary mapping parameter names to tensors
return weights_dict
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Undefined variable weights_dict in minimal loader snippet

Return a defined variable (align with the later, fuller example).

-class CustomWeightLoader(BaseWeightLoader):
-    def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
-        # Load weights from your custom format
-        # Return a dictionary mapping parameter names to tensors
-        return weights_dict
+class CustomWeightLoader(BaseWeightLoader):
+    def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
+        # Load weights from your custom format
+        # Return a dictionary mapping parameter names to tensors
+        weights: dict[str, Any] = {}
+        return weights
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```python
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
class CustomWeightLoader(BaseWeightLoader):
def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
# Load weights from your custom format
# Return a dictionary mapping parameter names to tensors
return weights_dict
```
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
class CustomWeightLoader(BaseWeightLoader):
def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
# Load weights from your custom format
# Return a dictionary mapping parameter names to tensors
weights: dict[str, Any] = {}
return weights
🤖 Prompt for AI Agents
In docs/source/features/checkpoint-loading.md around lines 62 to 71, the example
code returns an undefined variable weights_dict in the load_weights method.
Define weights_dict before returning it, for example by initializing it as an
empty dictionary or populating it with sample data, to align with the fuller
example later in the document.

Comment on lines +28 to +48
<figure>
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/feat/1.0_doc_dev/docs/source/1.0/features/media/feat_long_seq_chunked_attention.png" alt="feat_long_seq_chunked_attention" width="240" height="auto">
</figure>
</div>
<p align="center"><sub><em>Figure 1. Illustration of chunked attention </em></sub></p>

Instead of splitting the input tokens into smaller chunks for the whole model, chunked attention is another method that is only applied to the attention layers in models.

With chunked attention, the tokens in context requests are split into chunks of a specified size. Then tokens can only attend to other tokens in the same chunk. For example, if the chunk size is 3, we might have a mask illustrated in Figure 1. Each token only needs to attend to at most the past chunk-sized tokens. As a result, both the KV cache size and the attention computation can be significantly reduced.

Currently TensorRT-LLM can only support chunked attention in llama4 model with TRTLLM attention backend. TensorRT-LLM will read `attention_chunk_size` from the model config. If it is not None, the chunked attention will be enabled with chunk size `attention_chunk_size`. If you want to enable chunked attention to other models, you can set the `attention_chunk_size` in attention API to a valid value.

Note that chunked attention can only be applied to context requests.

## Sliding Window Attention

<div align="center">
<figure>
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/feat/1.0_doc_dev/docs/source/1.0/features/media/feat_long_seq_sliding_win_attn.png" alt="feat_long_seq_sliding_win_attn" width="240" height="auto">
</figure>
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use repository-relative image paths, not branch-specific URLs.

The <img src> links point to .../raw/feat/1.0_doc_dev/.... Once merged, that branch path vanishes and the images 404.
Store the images inside docs/source/…/media/ and reference them relatively:

<img src="../media/feat_long_seq_chunked_attention.png" ...>

Same fix for the sliding-window figure.

🤖 Prompt for AI Agents
In docs/source/features/long-sequence.md around lines 28 to 48, the image src
attributes use branch-specific URLs that will break after merging. Replace the
full GitHub URLs with relative paths pointing to the images inside the
docs/source/features/media/ directory, for example, change the src to
"../media/feat_long_seq_chunked_attention.png" for the chunked attention figure
and similarly update the sliding window attention figure's src to a relative
path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant