Releases · NVIDIA/TensorRT-LLM

18 Sep 01:49

zongfeijing

v1.1.0rc5

0c9430e

v1.1.0rc5 Pre-release

Pre-release

Announcement Highlights

Model Support
- Enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
- Enable KV-cache reuse and add E2E tests for llava-next (#7349)
- Support gpt-oss with fp8 kv cache (#7612)
- Support kvcache reuse for phi4mm (#7563)
API
- Add TorchLlmArgs to the connector api (#7493)
Benchmark
- Extend test_perf.py to add disagg-serving perf tests (#7503)
- Add accuracy test for deepseek-r1 with chunked_prefill (#7365)
Feature
- Optimize MLA kernels with separate reduction kernels (#7597)
- Wrap MOE with custom op (#7277)
- Make the should_use_spec_decode logic a bit smarter (#7112)
- Use a shell context to install dependancies (#7383)
- Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
- Support chunked prefill for multimodal models (#6843)
- Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
- Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
- Add deepseek r1-w4afp8 quickstart (#7645)
- Nanobind: Allow none types for fields in result (#7672)
- Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
- UCX zmq ip support ipv6 (#7530)
- Refactor: Quantization Transforms with Inheritance (#7227)

What's Changed

[None][chore] Remove closed bugs by @xinhe-nv in #7591
[https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
[None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
[None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
[https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
[#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
[https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
[None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
[TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
[TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
[TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
[None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
[https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
[None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
[None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
[#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
[#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
[None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
[TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
[https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
[https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
[None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
[https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
[None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
[https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
[None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
[None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
[None][ci] move some test cases from l40s to a30 by @QiJune in #7684
[None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
[https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
[https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
[TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
[TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
[TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
[None][ci] Some improvements for Slurm CI by @chzblych in #7689
[None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
[None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
[TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
[None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
[TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
[https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
[None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
[TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
[None][test] add test for min_tokens by @ixlmar in #7678
[TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
[None][ci] Test waives for the main branch 09/15 by @chzblych in #7709

New Contributors

@zheyuf made their first contribution in #7112

Full Changelog: v1.1.0rc4...v1.1.0rc5

Contributors

WilliamTambellini, karljang, and 34 other contributors

Assets 2

10 Sep 07:32

zongfeijing

v1.1.0rc4

62b564a

v1.1.0rc4 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Support phi-4 model in pytorch backend (#7371)
- Support Aggregate mode for phi4-mm (#7521)
API
- Implement basic functionalities for Responses API (#7341)
- Support multiple postprocess workers for chat completions API (#7508)
- Report failing requests (#7060)
Benchmark
- Test trtllm-serve with --extra_llm_api_options (#7492)
Feature
- Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
- Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
- Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
- Separate run_shape_prop as another graph utility (#7313)
- MultiLayer Eagle (#7234)
- Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
- Add NVFP4 x FP8 (#6809)
- Support hashing and KV cache reuse for videos (#7360)
- Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
- Introduce QKNormRoPEAttention module (#6830)
- AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
- Support KV cache salting for secure KV cache reuse (#7106)
- trtllm-gen kernels support sm103 (#7570)
- Move stop_criteria to sample_async (#7041)
- KV cache transfer for uneven pp (#7117)
- Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
- AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
- Add Request specific exception (#6931)
- Add DeepSeek-v3-0324 e2e torch test (#7413)
- Add 8-GPU test cases for RTX6000 (#7083)
- add gptoss 20g tests (#7361)
- Nixl support for GDS (#5488)
- CMake option to link statically with cublas/curand (#7178)
- Extend VLM factory and add Mistral3 factory (#7583)
Documentation
- fix example in docstring (#7410)
- Fix formatting error in Gemma3 readme (#7352)
- Add note about trtllm-serve to the devel container (#7483)
- add GPT OSS Eagle3 blog (#7140)
- 1.0 Documentation. (#6696)
- Update kvcache part (#7549)
- Rename TensorRT-LLM to TensorRT LLM. (#7554)
- refine docs for accuracy evaluation of gpt-oss models (#7252)

What's Changed

[https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
[None][infra] Using local variables in rerun function by @yiqingy0 in #7198
[None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
[https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
[None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
[None][doc] fix example in docstring by @tomeras91 in #7410
[TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
[None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
[TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
[https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
[None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
[None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
[https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
[https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
[None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
[None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
[None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
[None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
[TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
[https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
[https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
[https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
[TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
[https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
[None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
[None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
[None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
[#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
[None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
[https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
[TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
[https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
[TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
[TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
[None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
[TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
[None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
[None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
[https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
[https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
[None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
[None][test] update nim and full test list by @crazydemo in #7468
[None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
[OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
[https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
[TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
[https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
[None][feat] Add Request specific exception by @Shunkangz in #6931
[#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
[https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
[None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
[None][chore] Remove closed bugs by @xinhe-nv in #7408
[TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
[None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
[None][infra] update nspect version by @niukuo in #7552
*...

Contributors

WilliamTambellini, Superjomn, and 61 other contributors

Assets 2

15 Sep 05:11

zongfeijing

v1.1.0rc2.post2

ef0d06d

v1.1.0rc2.post2 Pre-release

Pre-release

Announcement Highlights

Feature
- Add MNNVL AlltoAll tests to pre-merge (#7465)
- Support multi-threaded tokenizers for trtllm-serve (#7515)
- FP8 Context MLA integration (#7581)
- Support block wise FP8 in wide ep (#7423)
- Cherry-pick Responses API and multiple postprocess workers support for chat harmony (#7600)
- Make low_precision_combine as a llm arg (#7598)
Documentation
- Update deployment guide and cherry-pick CI test fix from main (#7623)

What's Changed

[None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7465
[TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve by @nv-yilinf in #7515
[None][fix] trtllm-serve yaml loading by @Superjomn in #7551
[None][chore] Bump version to 1.1.0rc2.post2 by @yiqingy0 in #7582
[https://nvbugs/5498967][fix] Downgrade NCCL by @yizhang-nv in #7556
[TRTLLM-6994][feat] FP8 Context MLA integration. by @yuxianq in #7581
[TRTLLM-7831][feat] Support block wise FP8 in wide ep by @xxi-nv in #7423
[None][chore] Make use_low_precision_moe_combine as a llm arg by @zongfeijing in #7598
[None][fix] Update deployment guide and cherry-pick CI test fix from main by @dongfengy in #7623
[None][feat] Cherry-pick Responses API and multiple postprocess workers support for chat harmony by @JunyiXu-nv in #7600
[None][chore] Fix kernel launch param and add TRTLLM MoE backend test by @pengbowang-nv in #7524

New Contributors

@xxi-nv made their first contribution in #7423

Full Changelog: v1.1.0rc2.post1...v1.1.0rc2.post2

Contributors

Superjomn, zongfeijing, and 9 other contributors

Assets 2

06 Sep 00:06

zongfeijing

v1.1.0rc2.post1

9d6e87a

v1.1.0rc2.post1 Pre-release

Pre-release

Announcement Highlights:

API
- Update TargetInfo to accommodate CP in disagg (#7224)
Benchmark
- Minor fixes to slurm and benchmark scripts (#7453)
Feature
- Support DeepGEMM swap-AB on sm100 (#7355)
- Merge add sparse exp and shared exp into local re… (#7422)
- Add batch waiting when scheduling (#7287)
- Reuse pytorch memory segments occupied by cudagraph pool (#7457)
- Complete the last missing allreduce op in Llama3/4 (#7420)
Documentation
- Exposing the ADP balance strategy tech blog (#7380)
- Update Dynasor paper info (#7137)
- store blog 10 media via lfs (#7375)

What's Changed

[None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
[None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
[None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
[None] [fix] store blog 10 media via lfs by @Funatiq in #7375
[TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
[None][chore] bump version to 1.1.0rc2.post1 by @litaotju in #7396
[TRTLLM-6747][feat] Merge add sparse exp and shared exp into local re… by @zongfeijing in #7422
[None] [fix] Fix nsys in slurm scripts by @kaiyux in #7409
[None][feat] Support DeepGEMM swap-AB on sm100 by @Barry-Delaney in #7355
[None] [fix] Minor fixes to slurm and benchmark scripts by @kaiyux in #7453
[None][fix] Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7507
[TRTLLM-7008][fix] Add automatic shared memory delete if already exist by @dongxuy04 in #7377
[None][ci] Cherry-pick some improvements for Slurm CI setup from main branch by @chzblych in #7479
[https://nvbugs/5481434][feat] Reuse pytorch memory segments occupied by cudagraph pool by @HuiGao-NV in #7457
[None][fix] Update DG side branch name by @Barry-Delaney in #7491
[None][fix] Update DG commit by @Barry-Delaney in #7534
[None][fix] Fix a typo in the Slurm CI codes (#7485) by @chzblych in #7538
[https://nvbugs/5488582][fix] Avoid unexpected Triton recompilation in DG fused_moe. by @hyukn in #7495
[None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. by @hyukn in #7420
[None][opt] Add batch waiting when scheduling by @yunruis in #7287
[https://nvbugs/5485325][fix] Add a postprocess to the model engine to fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7373
[None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch by @timlee0212 in #7487

New Contributors

@AndyDai-nv made their first contribution in #7137

Full Changelog: v1.1.0rc2...v1.1.0rc2.post1

Contributors

litaotju, Funatiq, and 14 other contributors

Assets 2

04 Sep 08:24

zongfeijing

v1.1.0rc3

e81c50d

v1.1.0rc3 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Add fp8 support for Mistral Small 3.1 (#6731)
Benchmark
- add benchmark TRT flow test for MIG (#6884)
- Mistral Small 3.1 accuracy tests (#6909)
Feature
- Update TargetInfo to accommodate CP in disagg (#7224)
- Merge add sparse exp and shared exp into local reduction (#7369)
- Support NVFP4 KV Cache (#6244)
- Allocate MoE workspace only when necessary (release/1.0 retargeted) (#6955)
- Implement capturable drafting loops for speculation (#7100)
- Revert phi4-mm aggregate mode (#6907)
- Complete the last missing allreduce op in Llama3/4. (#6850)
Documentation
- Exposing the ADP balance strategy tech blog (#7380)
- Update Dynasor paper info (#7137)
- Add docs for Gemma3 VLMs (#6880)
- add legacy section for tensorrt engine (#6724)
- Update DeepSeek example doc (#7358)

What's Changed

[None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
[None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
[None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
[None] [fix] store blog 10 media via lfs by @Funatiq in #7375
[TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
[None][chore] Bump version to 1.1.0rc3 by @yiqingy0 in #7394
[TRTLLM-6747][feat] Merge add sparse exp and shared exp into local reduction by @zongfeijing in #7369
[None][feat] Support NVFP4 KV Cache by @Tom-Zheng in #6244
[None][ci] Some improvements for Slurm CI setup by @chzblych in #7407
[None][chore] Mass integration of release/1.0 - 2nd by @dominicshanshan in #7171
[None][test] Update case that not support passing quantization fp8 for pytorch backend by @nvamyt in #7302
[None][infra] Disable GB200-PyTorch-1 due to OOM issue by @yuanjingx87 in #7386
[https://nvbugs/5481087][fix] fix bug of ci when we use mocker by @byshiue in #7332
[None][infra] Waive failed case on main 0901 by @EmmaQiaoCh in #7447
[TRTLLM-7353][feat] Implement capturable drafting loops for speculation by @mikeiovine in #7100
[None] [doc] Update DeepSeek example doc by @jiahanc in #7358
[None][fix] Fix nanobind failure by @Tom-Zheng in #7425
[None][chore] Use llm args in create_py_executor by @leslie-fang25 in #7239

New Contributors

@AndyDai-nv made their first contribution in #7137

Full Changelog: v1.1.0rc2...v1.1.0rc3

Contributors

mikeiovine, byshiue, and 15 other contributors

Assets 2

31 Aug 02:22

Superjomn

v1.1.0rc2

15ec2b8

v1.1.0rc2 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Refactor llama4 for multimodal encoder IFB (#6844)
API
- Add standalone multimodal encoder (#6743)
- Enable Cross-Attention to use XQA kernels for Whisper (#7035)
- Enable nanobind as the default binding library (#6608)
- trtllm-serve + autodeploy integration (#7141)
- Chat completions API for gpt-oss (#7261)
- KV Cache Connector API (#7228)
- Create PyExecutor from TorchLlmArgs Part 1 (#7105)
- TP Sharding read from the model config (#6972)
Benchmark
- add llama4 tp4 tests (#6989)
- add test_multi_nodes_eval tests (#7108)
- nsys profile output kernel classifier (#7020)
- add kv cache size in bench metric and fix failed cases (#7160)
- add perf metrics endpoint to openai server and openai disagg server (#6985)
- add gpt-osss tests to sanity list (#7158)
- add l20 specific qa test list (#7067)
- Add beam search CudaGraph + Overlap Scheduler tests (#7326)
- Update qwen3 timeout to 60 minutes (#7200)
- Update maxnt of llama_v3.2_1b bench (#7279)
- Improve performance of PyTorchModelEngine._get_lora_params_from_requests (#7033)
- Accelerate global scale calculations for deepEP fp4 combine (#7126)
- Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238)
- Balance the request based on number of tokens in AttentionDP (#7183)
- Wrap the swiglu into custom op to avoid redundant device copy (#7021)
Feature
- Add QWQ-32b torch test (#7284)
- Fix llama4 multimodal by skipping request validation (#6957)
- Add group attention pattern for solar-pro-preview (#7054)
- Add Mistral Small 3.1 multimodal in Triton Backend (#6714)
- Update lora for phi4-mm (#6817)
- refactor the CUDA graph runner to manage all CUDA graphs (#6846)
- Enable chunked prefill for Nemotron-H (#6334)
- Add customized default routing method (#6818)
- Testing cache transmission functionality in Python (#7025)
- Simplify decoder state initialization for speculative decoding (#6869)
- Support MMMU for multimodal models (#6828)
- Deepseek: Start Eagle work (#6210)
- Optimize and refactor alltoall in WideEP (#6973)
- Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time (#7113)
- Hopper Fp8 context mla (#7116)
- Padding for piecewise cudagraph (#6750)
- Add low precision all2all for mnnvl (#7155)
- Use numa to bind CPU (#7304)
- Skip prefetching consolidated safetensors when appropriate (#7013)
- Unify sampler handle logits implementation (#6867)
- Move fusion, kvcache, and compile to modular inference optimizer (#7057)
- Make finalize fusion part of the tactic selection logic (#6915)
- Fuse slicing into MoE (#6728)
- Add logging for OAI disagg server (#7232)
Documentation
- Update gpt-oss deployment guide to latest release image (#7101)
- update stale link for AutoDeploy (#7135)
- Add GPT-OSS Deployment Guide into official doc site (#7143)
- Refine GPT-OSS doc (#7180)
- update feature_combination_matrix doc (#6691)
- update disagg doc about UCX_MAX_RNDV_RAILS (#7205)
- Display tech blog for nvidia.github.io domain (#7241)
- Updated blog9_Deploying_GPT_OSS_on_TRTLLM (#7260)
- Update autodeploy README.md, deprecate lm_eval in examples folder (#7233)
- add adp balance blog (#7213)
- fix doc formula (#7367)
- update disagg readme and scripts for pipeline parallelism (#6875)

What's Changed

[None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
[None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
[None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
[None][fix] fix llmapi import error by @crazydemo in #7030
[TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
[None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
[TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
[None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
[None][fix] fix scaffolding dynasor test by @dc3671 in #7070
[None][chore] Update namelist in blossom-ci by @karljang in #7015
[None][ci] move unittests to sub-directories by @Funatiq in #6635
[None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
[None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
[TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
[None][chore] Only check the bindings lib for current build by @liji-nv in #7026
[None][ci] move some tests of b200 to post merge by @QiJune in #7093
[https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
[TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
[None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
[None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
[None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
[None][chore] waive failed cases on H100 by @xinhe-nv in #7084
[None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
[https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
[https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
[None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
[https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
[https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
[None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
[https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
[None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
[#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
[None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
[None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
[None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
[TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
[None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
[TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
[None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
[TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
[TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
[None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
[None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
[#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
[TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
[None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
[None][feat] Deepseek: Start Eag...

Contributors

karljang, dcampora, and 76 other contributors

Assets 2

22 Aug 10:02

Superjomn

v1.1.0rc1

7334f93

v1.1.0rc1 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Add Tencent HunYuanMoEV1 model support (#5521)
- Support Yarn on Qwen3 (#6785)
API
- BREAKING CHANGE: Introduce sampler_type, detect sampler according to options (#6831)
- Introduce sampler options in trtllm bench (#6855)
- Support accurate device iter time (#6906)
- Add batch wait timeout in fetching requests (#6923)
Benchmark
- Add accuracy evaluation for AutoDeploy (#6764)
- Add accuracy test for context and generation workers with different models (#6741)
- Add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
- Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] (#6939)
- Add NIM Related Cases Part 1 (#6684)
Feature
- Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
- Add single block version renormalized routing kernel (#6756)
- Use Separate QKV Input Layout for Context MLA (#6538)
- Enable accuracy test for MTP and chunked prefill (#6314)
Documentation
- Update gpt-oss doc on MoE support matrix (#6908)
- Modify the description for MLA chunked context (#6929)
- Update wide-ep doc (#6933)
- Update gpt oss doc (#6954)
- Add more documents for large scale EP (#7029)
- Add documentation for relaxed test threshold (#6997)

What's Changed

[https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
[https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
[None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
[None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
[None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
[https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
[https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
[None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
[None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
[None][fix] Fix perfect router. by @bobboli in #6797
[https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
[None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
[None][doc] Modify the description for mla chunked context by @jmydurant in #6929
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
[None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
[https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
[https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
[None] [chore] Mamba cache in separate file by @tomeras91 in #6796
[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
[https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
[https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
[None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
[None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
[https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
[None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
[TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
[None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
[None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
[None][feat] Support Yarn on Qwen3 by @byshiue in #6785
[None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
[None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
[https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
[https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
[None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
[None][doc] Update gpt oss doc by @bobboli in #6954
[None] [feat] Support accurate device iter time by @kaiyux in #6906
[TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
[None] [fix] Fix the macro name by @ChristinaZ in #6983
[None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
[None][chore] Remove duplicate test waives by @yiqingy0 in #6998
[None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
[TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
[None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
[None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
[https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
[TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
[None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
[TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
[TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
[https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
[TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
[https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
[None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
[None][chore] Remove closed bugs by @xinhe-nv in #6969
[None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
[None] [doc] Add more documents for large scale EP by @kaiyux in #7029
[None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
[TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
[https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
[https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
[None][chore] Remo...

Contributors

dcampora, reasonsolo, and 50 other contributors

Assets 2

16 Aug 00:09

Superjomn

v1.1.0rc0

26f413a

v1.1.0rc0 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Add model gpt-oss (#6645)
- Support Aggregate mode for phi4-mm (#6184)
- Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
- Support running heterogeneous model execution for Nemotron-H (#6866)
- Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
API
- BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Benchmark
- Improve Llama4 performance for small max_seqlen cases (#6306)
- Multimodal benchmark_serving support (#6622)
- Add perf-sweep scripts (#6738)
Feature
- Support LoRA reload CPU cache evicted adapter (#6510)
- Add FP8 context MLA support for SM120 (#6059)
- Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
- Include attention dp rank info with KV cache events (#6563)
- Clean up ngram auto mode, add max_concurrency to configs (#6676)
- Add NCCL Symmetric Integration for All Reduce (#4500)
- Remove input_sf swizzle for module WideEPMoE (#6231)
- Enable guided decoding with disagg serving (#6704)
- Make fused_moe_cute_dsl work on blackwell (#6616)
- Move kv cache measure into transfer session (#6633)
- Optimize CUDA graph memory usage for spec decode cases (#6718)
- Core Metrics Implementation (#5785)
- Resolve KV cache divergence issue (#6628)
- AutoDeploy: Optimize prepare_inputs (#6634)
- Enable FP32 mamba ssm cache (#6574)
- Support SharedTensor on MultimodalParams (#6254)
- Improve dataloading for benchmark_dataset by using batch processing (#6548)
- Store the block of context request into kv cache (#6683)
- Add standardized GitHub issue templates and disable blank issues (#6494)
- Improve the performance of online EPLB on Hopper by better overlapping (#6624)
- Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
- CUTLASS MoE FC2+Finalize fusion (#3294)
- Add GPT OSS support for AutoDeploy (#6641)
- Add LayerNorm module (#6625)
- Support custom repo_dir for SLURM script (#6546)
- DeepEP LL combine FP4 (#6822)
- AutoTuner tuning config refactor and valid tactic generalization (#6545)
- Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
- Add support for Hopper MLA chunked prefill (#6655)
- Helix: extend mapping to support different CP types (#6816)
Documentation
- Remove the outdated features which marked as Experimental (#5995)
- Add LoRA feature usage doc (#6603)
- Add deployment guide section for VDR task (#6669)
- Add doc for multimodal feature support matrix (#6619)
- Move AutoDeploy README.md to torch docs (#6528)
- Add checkpoint refactor docs (#6592)
- Add K2 tool calling examples (#6667)
- Add the workaround doc for H200 OOM (#6853)
- Update moe support matrix for DS R1 (#6883)
- BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
[None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
[None][chore] update readme for perf release test by @ruodil in #6664
[None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
[None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
[None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
[https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
[None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
[None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
[None][test] correct test-db context for perf yaml file by @ruodil in #6686
[None] [feat] Add model gpt-oss by @hlu1 in #6645
[https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
[None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
[TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
[None][infra] Fix guardwords by @EmmaQiaoCh in #6711
[None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
[None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
[None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
[None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
[None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
[https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
[None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
[https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
[None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
[TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
[None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
[TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
[TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
[None][test] fix yml condition error under qa folder by @ruodil in #6734
[None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
[https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
[None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
[https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
[None][fix] Remove lock related typo in py_executor by @lancelly in #6653
[None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
[None][fix]revert kvcache transfer by @chuangz0 in #6709
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
[None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
[TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
[None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
[None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
[TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
[None][feat] Core Metrics Implementation by @hcyezhang in #5785
[https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
[TRTLLM-6637][feat]...

Contributors

Superjomn, MatthiasKohl, and 78 other contributors

Assets 2

07 Aug 10:54

Superjomn

v1.0.0rc6

a16ba64

v1.0.0rc6 Pre-release

Pre-release

Announcement Highlights:

Model Support
Feature
- Add LoRA support for Gemma3 (#6371)
- Add support of scheduling attention dp request (#6246)
- Multi-block mode for Hopper spec dec XQA kernel (#4416)
- LLM sleep & wakeup Part 1: virtual device memory (#5034)
- best_of/n for pytorch workflow (#5997)
- Add speculative metrics for trt llm bench (#6476)
- (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
- check input tokens + improve error handling (#5170)
- Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
- Add vLLM KV Pool support for XQA kernel (#6013)
- Switch to internal version of MMProjector in Gemma3 (#6572)
- Enable fp8 SwiGLU to minimize host overhead (#6540)
- Add Qwen3 MoE support to TensorRT backend (#6470)
- ucx establish connection with zmq (#6090)
- Disable add special tokens for Llama3.3 70B (#6482)
API
Benchmark
- ADP schedule balance optimization (#6061)
- allreduce benchmark for torch (#6271)
Documentation
- Make example SLURM scripts more parameterized (#6511)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
- Exposing the latest tech blogs in README.md (#6553)
- update known issues (#6247)
- trtllm-serve doc improvement. (#5220)
- Adding GPT-OSS Deployment Guide documentation (#6637)
- Exposing the GPT OSS model support blog (#6647)
- Add llama4 hybrid guide (#6640)
- Add DeepSeek R1 deployment guide. (#6579)
- Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
Known Issues
- On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

[fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
[TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
[fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
fix: Fix missing key by @zerollzeng in #6471
[https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
[https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
[None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
[None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
[None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
[None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
[None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
[https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
[TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
[https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
[None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
[AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
[TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
[TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
[None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
[None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
[None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
[None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
[None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
[TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
[fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
[None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
[None][infra] update namelist by @niukuo in #6465
[https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
[None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
[TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
[None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
[None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
[None][fix] remove closed bugs by @xinhe-nv in #6576
[None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
[None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
[None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
[TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
[None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
[None][test] update invalid test name by @crazydemo in #6596
[TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
[None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
[None][doc] Fix blog4 typo by @syuoni in #6612
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
[TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
[https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
[TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
[None][chore] Add readme for perf test by @ruodil in #6443
[https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
[None][chore] ucx establish connection with zmq by @chuangz0 in #6090
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
[None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
[TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in h...

Contributors

Superjomn, reasonsolo, and 53 other contributors

Assets 2

04 Aug 09:45

QiJune

v1.0.0rc5

fbee279

v1.0.0rc5 Pre-release

Pre-release

Announcement Highlights:

Model Support
Feature
- Deepseek R1 FP8 Support on Blackwell (#6486)
- Auto-enable ngram with concurrency <= 32. (#6232)
- Support turning on/off spec decoding dynamically (#6363)
- Improve LoRA cache memory control (#6220)
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
- Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
- Add support for external multimodal embeddings (#6263)
- Add support for disaggregation with pp with pytorch backend (#6369)
- Add _prepare_and_schedule_batch function in PyExecutor (#6365)
- Add status tags to LLM API reference (#5707)
- Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
- Support JSON Schema in OpenAI-Compatible API (#6321)
- Support chunked prefill on spec decode 2 model (#6104)
- Enhance beam search support with CUDA graph integration (#6217)
- Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
- Add KV cache reuse support for multimodal models (#5444)
- Multistream initial support for torch compile flow (#5847)
- Support nanobind bindings (#6185)
- Support Weight-Only-Quantization in PyTorch Workflow (#5850)
- Support pytorch LoRA adapter eviction (#5616)
API
- [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
Bug Fixes
- fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
- fix illeagel memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Switch placement of image placeholder for mistral 3.1 (#6435)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
Benchmark
- Fixes to parameter usage and low latency configuration. (#6343)
- Add Acceptance Rate calculation to benchmark_serving (#6240)
Performance
- Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
- Optimize Mtp performance (#5689)
- Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
- Add non UB AR + Residual + Norm + Quant fusion (#6320)
Infrastructure
- Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
- Use build stage wheels to speed up docker release image build (#4939)
Documentation
- Add README for wide EP (#6356)
- Update Llama4 deployment guide: update config & note concurrency (#6222)
- Add Deprecation Policy section (#5784)
Known Issues
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
- The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

DeepEP LL support variable hidden size and tokens num by @yilin-void in #6141
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 by @byshiue in #6065
fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings by @MartinMarciniszyn in #6189
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction by @amitz-nv in #5616
W4A8 GEMM by @danielafrimi in #6005
enh: Lift expectation of single image per sample in Gemma3 VLM by @brb-nv in #6195
test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test by @ruodil in #5826
fix: Flush stale PlanParams with custom attention mask by @brb-nv in #6163
doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… by @nv-guomingz in #6150
[fix] Fix can_use_alltoall in fused_moe_wide_ep.py by @jinyangyuan-nvidia in #6173
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #5850
test: [CI] remove closed bugs by @xinhe-nv in #6201
feat: nanobind bindings by @Linda-Stadter in #6185
infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) by @ZhanruiSunCh in #4656
doc: add Deprecation Policy section by @QiJune in #5784
[TRTLLM-4279] feat: Multistream initial support for torch compile flow by @liji-nv in #5847
[Infra] - Waive failed cases on recent post-merge by @EmmaQiaoCh in #6212
[BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve by @LinPoly in #5717
test: Enable GB200 torch compile multi gpu tests by @yizhang-nv in #6145
[fix] Correct the returned value of has_spec_drafter by @ziyixiong-nv in #6178
[chore] Clean up quickstart_advanced.py by @mikeiovine in #6021
[Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] by @SimengLiu-nv in #5859
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models by @chang-l in #5444
feat: Refactor the fetching request logic by @Shunkangz in #5786
tests: add timeout_manager to tensorrt flow test cases by @crazydemo in #5942
feat: moe prepare support topk % 4 != 0 by @WeiHaocheng in #5742
[fix] Fix flaky mistral E2E test by @2ez4bz in #6230
bug: [https://nvbugs/5368507] Fix test_generate_with_seed. by @bobboli in #6206
chore: Mass integration of release/0.21 (part 4) by @dc3671 in #6211
doc: add supported data modality and types on multimodal serve by @yechank-nvidia in #5988
chore: bump version to 1.0.0rc5 by @yiqingy0 in #6252
[TRTLLM-6537][infra] extend multi-gpu tests related file list by @reasonsolo in #6139
test: update test list for RTX6KD by @StanleySun639 in #6213
fix: bindings unit tests for nanobind by @Linda-Stadter in #6221
Add register_fake for finegrained_mixed_dtype_gemm torch_op by @danielafrimi in #6255
[Issue 6193] Fix gemma3vl weight loader by @johncalesp in #6233
[feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM by @2ez4bz in #6152
set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only by @yuanjingx87 in #6234
[nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency by @raayandhar in #6222
[AutoDeploy] merge feat/ad-2025-07-07 by @lucaslie in #6196
[nvbugs/5401261][fix] Fix Triton backend disaggregated serving support by @Tabrizian in #6224
[refactor] Simplification of Speculative decoding configs - Part 2 by @wili-65535 in #5936
doc: Refactor documents and examples of disaggregated serving and wide ep by @kaiyux in #6054
Add basic Nemo Ckpt Lora Loading in pytorch flow by @venkywonka in #6019
[https://nvbugs/5387771] fix deadlocks due to insufficient numSemaphores by @PerkzZheng in #6262
fix: nvbug_5398806 by @hchings in #6239
chore: set default device to cpu on Multimodal models by @yechank-nvidia in #5994
chore: remove duplicate should_stop_processing check by @QiJune in #6242
hopper-style context MLA by @zhou-yuxin in #5713
[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue by @yweng0828 in #6136
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6289
[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler by @stnie in #6223
[Infra] - Skip failed cases by @EmmaQiaoCh in #6299
[AutoDeploy] disable flaky MoE nvfp4 test by @lucaslie in #6302
[feat] Update .coderabbit.yaml with review settings and code guidelines by @venkywonka in #6251
Waive tests by @Tabrizian in https://github.com/NVIDIA/Ten...