Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
6e92c90
fix: Fix/fused moe 0.19 (#3799)
byshiue Apr 24, 2025
2554823
fix: Add pre-download of checkpoint before benchmark. (#3772)
FrankD412 Apr 25, 2025
89ea617
[https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding wi…
syuoni Apr 25, 2025
377cee8
TRTLLM-4875 feat: Add version switcher to doc (#3871)
kaiyux Apr 25, 2025
474b964
waive a test (#3897)
Superjomn Apr 28, 2025
8284c1f
docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939)
nv-guomingz Apr 29, 2025
9d31e8f
fix: remote mpi session abort (#3884)
Superjomn Apr 29, 2025
302ce1f
skip fp8 gemm for pre-hopper (#3931)
crazydemo Apr 30, 2025
b544308
[https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with ove…
syuoni May 1, 2025
10d2231
Doc: Fix H200 DeepSeek R1 perf doc (#4006)
jiahanc May 2, 2025
85162d7
Fix the perf regression caused by insufficient cache warmup. (#4042)
hyukn May 3, 2025
6b47a4f
doc: Update 0.19.0 release notes (#3976)
kaiyux May 6, 2025
4aa4098
Optimize the AutoTuner cache access code to reduce host code overhead…
hyukn May 6, 2025
dd1bd3c
Update switcher (#4098)
kaiyux May 6, 2025
710535f
doc: update release notes (#4108)
kaiyux May 7, 2025
4afc1ce
docs:update 0.19 doc. (#4120)
nv-guomingz May 7, 2025
a478649
docs:add torch flow supported model list. (#4129)
nv-guomingz May 8, 2025
e3aff8b
doc: Release V0.19 Perf Overview Update (#4166)
zbpatel May 9, 2025
2d966da
Fix readme of autodeploy.
dcampora May 13, 2025
cbdcbc3
Update tensorrt_llm/_torch/pyexecutor/llm_request.py
dcampora May 14, 2025
db6f3b3
Revert mgmn worker node.
dcampora May 14, 2025
5d3fa62
Change to disable_overlap_scheduler.
dcampora May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/source/_static/switcher.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,21 @@
"preferred": true,
"version": "latest",
"url": "https://nvidia.github.io/TensorRT-LLM/latest"
},
{
"version": "0.19.0",
"url": "https://nvidia.github.io/TensorRT-LLM/0.19.0"
},
{
"version": "0.20.0rc0",
"url": "https://nvidia.github.io/TensorRT-LLM/0.20.0rc0"
},
{
"version": "0.19.0rc0",
"url": "https://nvidia.github.io/TensorRT-LLM/0.19.0rc0"
},
{
"version": "0.18.2",
"url": "https://nvidia.github.io/TensorRT-LLM/0.18.2"
}
]
6 changes: 3 additions & 3 deletions docs/source/architecture/core-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,16 +168,16 @@ As a result, even if TensorRT has a powerful pattern-matching algorithm and
supports a lot of possible fusions, there is always the risk that it cannot
identify uncommon and/or very advanced patterns. To overcome that inevitable
limitation, TensorRT offers a powerful mechanism known as
[plugins](https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Plugin/pyPlugin.html).
[plugins](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Plugin/pyPlugin.html).

The plugins are nodes inserted in the network graph definition that map to user-defined
GPU kernels. TensorRT-LLM uses a number of such plugins. They can be found in
the [`cpp/tensorrt_llm/plugins`](source:/cpp/tensorrt_llm/plugins) directory.

Plugins are written in C++ and follow a well-defined interface described in the
[Extending TensorRT with Custom Layers](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#extending)
[Extending TensorRT with Custom Layers](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/extending-custom-layers.html)
section of the TensorRT
[Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html).
[Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html).
When executed within a TensorRT engine, plugins trigger the execution of
their encapsulated GPU kernels. A fairly simple example of plugins is the
[`QuantizeTensorPlugin`](source:/cpp/tensorrt_llm/plugins/quantizeTensorPlugin) that
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ trtllm-bench -m deepseek-ai/DeepSeek-R1 \
--dataset $YOUR_DATA_PATH \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 1127 \
--max_num_tokens 1151 \
--num_requests 5120 \
--concurrency 1024 \
--kv_cache_free_gpu_mem_fraction 0.8 \
Expand All @@ -339,13 +339,13 @@ The perf might be different from different datasets and machines
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec): 5.1532
Total Output Throughput (tokens/sec): 10553.8445
Per User Output Throughput (tokens/sec/user): 10.4199
Per GPU Output Throughput (tokens/sec/gpu): 1319.2306
Total Token Throughput (tokens/sec): 15707.0888
Total Latency (ms): 993548.8470
Average request latency (ms): 197768.0434
Request Throughput (req/sec): 5.6100
Total Output Throughput (tokens/sec): 11489.2671
Per User Output Throughput (tokens/sec/user): 11.3476
Per GPU Output Throughput (tokens/sec/gpu): 1436.1584
Total Token Throughput (tokens/sec): 17233.9007
Total Latency (ms): 912656.9938
Average request latency (ms): 181540.5739
```

## Exploring more ISL/OSL combinations
Expand Down
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@

html_theme = 'nvidia_sphinx_theme'
html_static_path = ['_static']
html_extra_path = ["./_static/switcher.json"]
html_theme_options = {
"switcher": {
"json_url": "./_static/switcher.json",
Expand Down
509 changes: 184 additions & 325 deletions docs/source/performance/perf-overview.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/source/quick-start-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Here is a simple example to show how to use the LLM API with TinyLlama.
```

You can also directly load TensorRT Model Optimizer's [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor.
To learn more about the LLM API, check out the [](llm-api/index) and [](llm-api-examples/index).
To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples).

(deploy-with-trtllm-serve)=
## Deploy with trtllm-serve
Expand Down Expand Up @@ -151,7 +151,7 @@ In this Quick Start Guide, you:

For more examples, refer to:

- [examples/](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for showcases of how to run a quick benchmark on latest LLMs.
- [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for showcases of how to run a quick benchmark on latest LLMs.

## Related Information

Expand Down
29 changes: 28 additions & 1 deletion docs/source/reference/support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,34 @@

TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.

## Models
## Models (PyTorch Backend)

| Architecture | Model | HuggingFace Example | Modality |
|--------------|-------|---------------------|----------|
| `BertForSequenceClassification` | BERT-based | `textattack/bert-base-uncased-yelp-polarity` | L |
| `DeciLMForCausalLM` | Nemotron | `nvidia/Llama-3_1-Nemotron-51B-Instruct` | L |
| `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3 `| L |
| `LlavaLlamaModel` | VILA | `Efficient-Large-Model/NVILA-8B` | L + V |
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT | `llava-hf/llava-v1.6-mistral-7b-hf` | L + V |
| `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA | `meta-llama/Meta-Llama-3.1-70B` | L |
| `Llama4ForConditionalGeneration` | Llama 4 | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | L |
| `MistralForCausalLM` | Mistral | `mistralai/Mistral-7B-v0.1` | L |
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` | L |
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` | L |
| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` | L |
| `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | L |
| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` | L |
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B` | L |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` | L |
| `Qwen2VLForConditionalGeneration` | Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | L + V |
| `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | `Qwen/Qwen2.5-VL-7B-Instruct` | L + V |

Note:
- L: Language only
- L + V: Language and Vision multimodal support
- Llama 3.2 accepts vision input, but our support currently limited to text only.

## Models (TensorRT Backend)

### LLM Models

Expand Down
123 changes: 122 additions & 1 deletion docs/source/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,127 @@
All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).


## TensorRT-LLM Release 0.19.0

### Key Features and Enhancements
- **The C++ runtime is now open sourced.**
- **PyTorch workflow**
- Added DeepSeek V3/R1 support. Refer to `examples/deepseek_v3/README.md`, also to the blog `docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md`.
- Added Llava-Next support.
- Added BERT support.
- Added a C++ based decoder, which added support for:
- TopK / TopP.
- Bad words.
- Stop words.
- Embedding bias.
- Added Autotuner for custom-op-compatible tuning process.
- Added a Python-based Autotuner core framework for kernel tuning.
- Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
- Added guided decoding support (XGrammar integration).
- Added pipeline parallelism support for the overlap scheduler in `PyExecutor`.
- Added Qwen2VL model support.
- Added mixed precision quantization support.
- Added pipeline parallelism with attention DP support.
- Added no-cache attention support.
- Added `PeftCacheManager` support.
- Added Qwen2.5‑VL support and refactored Qwen2‑VL.
- Added trtllm‑gen FP4 GEMM support.
- Added Qwen2 MoE support.
- Applied `AutoTuner` to both Fused MoE and NVFP4 Linear operators.
- Introduced a `UserBuffers` allocator.
- Added Deepseek eager mode AllReduce fusion support.
- Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of `examples/deepseek_v3/README.md`.
- Added FlashMLA support for SM90.
- Added support for enabling MTP with CUDA graph padding.
- Added initial EAGLE-3 implementation.
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
- **AutoDeploy for PyTorch workflow**.
- The AutoDeploy for PyTorch workflow is an **experimental** feature in `tensorrt_llm._torch.auto_deploy`.
- AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
- Check out `examples/auto_deploy/README.md` for more details.
- LLM API
- [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
- Added batched logits processor support.
- Added EAGLE support.
- Added abort request support.
- Added `get_stats` support.
- Added multi-node support for Slurm-based clusters, refer to `examples/llm-api/llm_mgmn_*.sh`.
- Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in `examples/multimodal/README.md`.
- Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in `examples/mixtral/README.md`.
- Added Qwen2-Audio support. Refer to `examples/qwen2audio/README.md`.
- Added Language-Adapter support. Refer to `examples/language_adapter/README.md`.
- Added STDiT for OpenSoRA text-to-video support. Refer to `examples/stdit/README.md`.
- Added vision encoders with tensor parallelism and context parallelism support. Refer to `examples/vit/README.md`.
- Added EXAONE-Deep support. Refer to `examples/exaone/README.md`.
- Added support for Phi-4-mini and Phi‑4‑MM.
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at `examples/gemma/README.md`.
- Added FP8 quantization support for Qwen2-VL.
- Added batched inference support for the LLM API MMLU example `examples/mmlu_llmapi.py`.
- Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
- Added Mamba-Hybrid support.
- Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
- Added a `--quantize_lm_head` option `examples/quantization/quantize.py` to support `lm_head` quantization.
- Added batched tensor FP4 quantization support.
- Added a `/metrics` endpoint for `trtllm-serve` to log iteration statistics.
- Added LoRA support for Phi-2 model.
- Added returning context logits support for `trtllm-serve`.
- Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
- Added request BW metric measurement for `disaggServerBenchmark`.
- Updated logits bitmask kernel to v3.
- Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
- Added iteration log support for `trtllm-bench`.
- `fp8_blockscale_gemm` is now open-sourced.
- Added AWQ support for ModelOpt checkpoints.
- Added Linear block scale layout support in FP4 quantization.
- Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
- Added Variable-Beam-Width-Search (VBWS) support (part2).
- Added LoRA support for Gemma.
- Refactored scaffolding worker, added OpenAI API worker support.
- Optionally split MoE inputs into chunks to reduce GPU memory usage.
- Added UCX IP interface support.
- [BREAKING CHANGE] Added output of first token to additional generation outputs.
- Added FP8 support for SM120 architecture.
- Registered `ENABLE_MULTI_DEVICE` and `ENABLE_UCX` as CMake options.
- Made the scaffolding Controller more generic.
- Breaking change: Added individual gatherContext support for each additional output.
- Enabled `PyExecutor` inference flow to estimate `max_num_tokens` for `kv_cache_manager`.
- Added `TLLM_OVERRIDE_LAYER_NUM` and `TLLM_TRACE_MODEL_FORWARD` environment variables for debugging.
- Supported aborting disconnected requests.
- Added an option to run disaggregated serving without context servers.
- Fixed and improved allreduce and fusion kernels.
- Enhanced the integrated robustness of scaffolding via `init.py`.

### API Changes
- Exposed `kc_cache_retention_config` from C++ `executor` API to the LLM API.
- Moved `BuildConfig` arguments to `LlmArgs`.
- Removed speculative decoding parameters from stateful decoders.
- Exposed `DecoderState` via bindings and integrated it in decoder.
- Refactored the `LlmArgs` with `Pydantic` and migrated remaining pybinding configurations to Python.
- Refactored disaggregated serving scripts.
- Added `numNodes` to `ParallelConfig`.
- Redesigned the multi‑stream API for DeepSeek.

### Fixed Issues
- Fixed misused length argument of PluginField. Thanks to the contribution from @jl749 in #2712. This also fixes #2685.
- Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
- Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
- Fixed incorrect batch slot usage in `addCumLogProbs` kernel. Thanks to the contribution from @aotman in #2787.
- Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
- Removed the necessary of `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm`.

### Infrastructure Changes
- The dependent NVIDIA ModelOpt version is updated to 0.27.

### Known Issues
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.


## TensorRT-LLM Release 0.18.2

### Key Features and Enhancements
- This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.


## TensorRT-LLM Release 0.18.1

### Key Features and Enhancements
Expand Down Expand Up @@ -65,7 +186,7 @@ All published functionality in the Release Notes has been fully tested and verif
### Known Issues
- Need `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm` due to new third-party dependencies.
- The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related [GitHub issue](https://github.com/pytorch/pytorch/issues/144966).
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.

### Fixed Issues
- Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/torch.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --ex

- [Architecture Overview](./torch/arch_overview.md)
- [Adding a New Model](./torch/adding_new_model.md)
- [Examples](../../examples/pytorch/README.md)
- [Examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/pytorch/README.md)

## Key Components

Expand Down
2 changes: 1 addition & 1 deletion examples/auto_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ llm = LLM(

</details>

For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/).
For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html).

______________________________________________________________________

Expand Down
2 changes: 1 addition & 1 deletion examples/llm-api/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# LLM API Examples

Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
Loading