NVIDIA · dcampora · May 16, 2025 · Apr 24, 2025 · Apr 25, 2025 · Apr 25, 2025
@@ -3,5 +3,21 @@
         "preferred": true,
         "version": "latest",
         "url": "https://nvidia.github.io/TensorRT-LLM/latest"
+    },
+    {
+        "version": "0.19.0",
+        "url": "https://nvidia.github.io/TensorRT-LLM/0.19.0"
+    },
+    {
+        "version": "0.20.0rc0",
+        "url": "https://nvidia.github.io/TensorRT-LLM/0.20.0rc0"
+    },
+    {
+        "version": "0.19.0rc0",
+        "url": "https://nvidia.github.io/TensorRT-LLM/0.19.0rc0"
+    },
+    {
+        "version": "0.18.2",
+        "url": "https://nvidia.github.io/TensorRT-LLM/0.18.2"
     }
 ]
@@ -168,16 +168,16 @@ As a result, even if TensorRT has a powerful pattern-matching algorithm and
 supports a lot of possible fusions, there is always the risk that it cannot
 identify uncommon and/or very advanced patterns. To overcome that inevitable
 limitation, TensorRT offers a powerful mechanism known as
-[plugins](https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Plugin/pyPlugin.html).
+[plugins](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/python-api/infer/Plugin/pyPlugin.html).
 
 The plugins are nodes inserted in the network graph definition that map to user-defined
 GPU kernels. TensorRT-LLM uses a number of such plugins. They can be found in
 the [`cpp/tensorrt_llm/plugins`](source:/cpp/tensorrt_llm/plugins) directory.
 
 Plugins are written in C++ and follow a well-defined interface described in the
-[Extending TensorRT with Custom Layers](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#extending)
+[Extending TensorRT with Custom Layers](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/extending-custom-layers.html)
 section of the TensorRT
-[Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html).
+[Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html).
 When executed within a TensorRT engine, plugins trigger the execution of
 their encapsulated GPU kernels. A fairly simple example of plugins is the
 [`QuantizeTensorPlugin`](source:/cpp/tensorrt_llm/plugins/quantizeTensorPlugin) that

@@ -325,7 +325,7 @@ trtllm-bench -m deepseek-ai/DeepSeek-R1 \
     --dataset $YOUR_DATA_PATH \
     --backend pytorch \
     --max_batch_size 128 \
-    --max_num_tokens 1127 \
+    --max_num_tokens 1151 \
     --num_requests 5120 \
     --concurrency 1024 \
     --kv_cache_free_gpu_mem_fraction 0.8 \
@@ -339,13 +339,13 @@ The perf might be different from different datasets and machines
 ===========================================================
 = PERFORMANCE OVERVIEW
 ===========================================================
-Request Throughput (req/sec):                     5.1532
-Total Output Throughput (tokens/sec):             10553.8445
-Per User Output Throughput (tokens/sec/user):     10.4199
-Per GPU Output Throughput (tokens/sec/gpu):       1319.2306
-Total Token Throughput (tokens/sec):              15707.0888
-Total Latency (ms):                               993548.8470
-Average request latency (ms):                     197768.0434
+Request Throughput (req/sec):                     5.6100
+Total Output Throughput (tokens/sec):             11489.2671
+Per User Output Throughput (tokens/sec/user):     11.3476
+Per GPU Output Throughput (tokens/sec/gpu):       1436.1584
+Total Token Throughput (tokens/sec):              17233.9007
+Total Latency (ms):                               912656.9938
+Average request latency (ms):                     181540.5739
 ```
 
 ## Exploring more ISL/OSL combinations

@@ -87,6 +87,7 @@
 
 html_theme = 'nvidia_sphinx_theme'
 html_static_path = ['_static']
+html_extra_path = ["./_static/switcher.json"]
 html_theme_options = {
     "switcher": {
         "json_url": "./_static/switcher.json",

@@ -15,7 +15,7 @@ Here is a simple example to show how to use the LLM API with TinyLlama.
 ```
 
 You can also directly load TensorRT Model Optimizer's [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor.
-To learn more about the LLM API, check out the [](llm-api/index) and [](llm-api-examples/index).
+To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples).
 
 (deploy-with-trtllm-serve)=
 ## Deploy with trtllm-serve
@@ -151,7 +151,7 @@ In this Quick Start Guide, you:
 
 For more examples, refer to:
 
-- [examples/](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for showcases of how to run a quick benchmark on latest LLMs.
+- [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for showcases of how to run a quick benchmark on latest LLMs.
 
 ## Related Information
 

@@ -4,7 +4,34 @@
 
 TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA GPUs. The following sections provide a list of supported GPU architectures as well as important features implemented in TensorRT-LLM.
 
-## Models
+## Models (PyTorch Backend)
+
+| Architecture | Model | HuggingFace Example | Modality |
+|--------------|-------|---------------------|----------|
+| `BertForSequenceClassification` | BERT-based | `textattack/bert-base-uncased-yelp-polarity` | L |
+| `DeciLMForCausalLM` | Nemotron | `nvidia/Llama-3_1-Nemotron-51B-Instruct` | L |
+| `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3 `| L |
+| `LlavaLlamaModel` | VILA | `Efficient-Large-Model/NVILA-8B` | L + V |
+| `LlavaNextForConditionalGeneration` | LLaVA-NeXT | `llava-hf/llava-v1.6-mistral-7b-hf` | L + V |
+| `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA | `meta-llama/Meta-Llama-3.1-70B` | L |
+| `Llama4ForConditionalGeneration` | Llama 4 | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | L |
+| `MistralForCausalLM` | Mistral | `mistralai/Mistral-7B-v0.1` | L |
+| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` | L |
+| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` | L |
+| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` | L |
+| `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | L |
+| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` | L |
+| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B` | L |
+| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` | L |
+| `Qwen2VLForConditionalGeneration` | Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | L + V |
+| `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | `Qwen/Qwen2.5-VL-7B-Instruct` | L + V |
+
+Note:
+- L: Language only
+- L + V: Language and Vision multimodal support
+- Llama 3.2 accepts vision input, but our support currently limited to text only.
+
+## Models (TensorRT Backend)
 
 ### LLM Models
 

@@ -5,6 +5,127 @@
 All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
 
 
+## TensorRT-LLM Release 0.19.0
+
+### Key Features and Enhancements
+  - **The C++ runtime is now open sourced.**
+  - **PyTorch workflow**
+    - Added DeepSeek V3/R1 support. Refer to `examples/deepseek_v3/README.md`, also to the blog `docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md`.
+    - Added Llava-Next support.
+    - Added BERT support.
+    - Added a C++ based decoder, which added support for:
+      - TopK / TopP.
+      - Bad words.
+      - Stop words.
+      - Embedding bias.
+    - Added Autotuner for custom-op-compatible tuning process.
+      - Added a Python-based Autotuner core framework for kernel tuning.
+      - Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
+    - Added guided decoding support (XGrammar integration).
+    - Added pipeline parallelism support for the overlap scheduler in `PyExecutor`.
+    - Added Qwen2VL model support.
+    - Added mixed precision quantization support.
+    - Added pipeline parallelism with attention DP support.
+    - Added no-cache attention support.
+    - Added `PeftCacheManager` support.
+    - Added Qwen2.5‑VL support and refactored Qwen2‑VL.
+    - Added trtllm‑gen FP4 GEMM support.
+    - Added Qwen2 MoE support.
+    - Applied `AutoTuner` to both Fused MoE and NVFP4 Linear operators.
+    - Introduced a `UserBuffers` allocator.
+    - Added Deepseek eager mode AllReduce fusion support.
+    - Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of `examples/deepseek_v3/README.md`.
+    - Added FlashMLA support for SM90.
+    - Added support for enabling MTP with CUDA graph padding.
+    - Added initial EAGLE-3 implementation.
+    - Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
+  - **AutoDeploy for PyTorch workflow**.
+    - The AutoDeploy for PyTorch workflow is an **experimental** feature in `tensorrt_llm._torch.auto_deploy`.
+    - AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
+    - Check out `examples/auto_deploy/README.md` for more details.
+  - LLM API
+    - [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
+    - Added batched logits processor support.
+    - Added EAGLE support.
+    - Added abort request support.
+    - Added `get_stats` support.
+    - Added multi-node support for Slurm-based clusters, refer to `examples/llm-api/llm_mgmn_*.sh`.
+  - Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in `examples/multimodal/README.md`.
+  - Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in `examples/mixtral/README.md`.
+  - Added Qwen2-Audio support. Refer to `examples/qwen2audio/README.md`.
+  - Added Language-Adapter support. Refer to `examples/language_adapter/README.md`.
+  - Added STDiT for OpenSoRA text-to-video support. Refer to `examples/stdit/README.md`.
+  - Added vision encoders with tensor parallelism and context parallelism support. Refer to `examples/vit/README.md`.
+  - Added EXAONE-Deep support. Refer to `examples/exaone/README.md`.
+  - Added support for Phi-4-mini and Phi‑4‑MM.
+  - Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at `examples/gemma/README.md`.
+  - Added FP8 quantization support for Qwen2-VL.
+  - Added batched inference support for the LLM API MMLU example `examples/mmlu_llmapi.py`.
+  - Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
+  - Added Mamba-Hybrid support.
+  - Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
+  - Added a `--quantize_lm_head` option `examples/quantization/quantize.py` to support `lm_head` quantization.
+  - Added batched tensor FP4 quantization support.
+  - Added a `/metrics` endpoint for `trtllm-serve` to log iteration statistics.
+  - Added LoRA support for Phi-2 model.
+  - Added returning context logits support for `trtllm-serve`.
+  - Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
+  - Added request BW metric measurement for `disaggServerBenchmark`.
+  - Updated logits bitmask kernel to v3.
+  - Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
+  - Added iteration log support for `trtllm-bench`.
+  - `fp8_blockscale_gemm` is now open-sourced.
+  - Added AWQ support for ModelOpt checkpoints.
+  - Added Linear block scale layout support in FP4 quantization.
+  - Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
+  - Added Variable-Beam-Width-Search (VBWS) support (part2).
+  - Added LoRA support for Gemma.
+  - Refactored scaffolding worker, added OpenAI API worker support.
+  - Optionally split MoE inputs into chunks to reduce GPU memory usage.
+  - Added UCX IP interface support.
+  - [BREAKING CHANGE] Added output of first token to additional generation outputs.
+  - Added FP8 support for SM120 architecture.
+  - Registered `ENABLE_MULTI_DEVICE` and `ENABLE_UCX` as CMake options.
+  - Made the scaffolding Controller more generic.
+  - Breaking change: Added individual gatherContext support for each additional output.
+  - Enabled `PyExecutor` inference flow to estimate `max_num_tokens` for `kv_cache_manager`.
+  - Added `TLLM_OVERRIDE_LAYER_NUM` and `TLLM_TRACE_MODEL_FORWARD` environment variables for debugging.
+  - Supported aborting disconnected requests.
+  - Added an option to run disaggregated serving without context servers.
+  - Fixed and improved allreduce and fusion kernels.
+  - Enhanced the integrated robustness of scaffolding via `init.py`.
+
+### API Changes
+  - Exposed `kc_cache_retention_config` from C++ `executor` API to the LLM API.
+  - Moved `BuildConfig` arguments to `LlmArgs`.
+  - Removed speculative decoding parameters from stateful decoders.
+  - Exposed `DecoderState` via bindings and integrated it in decoder.
+  - Refactored the `LlmArgs` with `Pydantic` and migrated remaining pybinding configurations to Python.
+  - Refactored disaggregated serving scripts.
+  - Added `numNodes` to `ParallelConfig`.
+  - Redesigned the multi‑stream API for DeepSeek.
+
+### Fixed Issues
+  - Fixed misused length argument of PluginField. Thanks to the contribution from @jl749 in #2712. This also fixes #2685.
+  - Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
+  - Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
+  - Fixed incorrect batch slot usage in `addCumLogProbs` kernel. Thanks to the contribution from @aotman in #2787.
+  - Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
+  - Removed the necessary of `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm`.
+
+### Infrastructure Changes
+  - The dependent NVIDIA ModelOpt version is updated to 0.27.
+
+### Known Issues
+  - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
+
+
+## TensorRT-LLM Release 0.18.2
+
+### Key Features and Enhancements
+  - This update addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/.
+
+
 ## TensorRT-LLM Release 0.18.1
 
 ### Key Features and Enhancements
@@ -65,7 +186,7 @@ All published functionality in the Release Notes has been fully tested and verif
 ### Known Issues
   - Need `--extra-index-url https://pypi.nvidia.com` when running `pip install tensorrt-llm` due to new third-party dependencies.
   - The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related [GitHub issue](https://github.com/pytorch/pytorch/issues/144966).
-  - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
+  - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
 
 ### Fixed Issues
   - Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.

@@ -41,7 +41,7 @@ scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --ex
 
 - [Architecture Overview](./torch/arch_overview.md)
 - [Adding a New Model](./torch/adding_new_model.md)
-- [Examples](../../examples/pytorch/README.md)
+- [Examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/pytorch/README.md)
 
 ## Key Components
 

@@ -270,7 +270,7 @@ llm = LLM(
 
 </details>
 
-For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/).
+For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html).
 
 ______________________________________________________________________
 

@@ -1,3 +1,3 @@
 # LLM API Examples
 
-Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
+Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.