Releases: NVIDIA/TensorRT-LLM
Releases · NVIDIA/TensorRT-LLM
v1.1.0rc5
Announcement Highlights
- Model Support
- API
- Add TorchLlmArgs to the connector api (#7493)
- Benchmark
- Feature
- Optimize MLA kernels with separate reduction kernels (#7597)
- Wrap MOE with custom op (#7277)
- Make the should_use_spec_decode logic a bit smarter (#7112)
- Use a shell context to install dependancies (#7383)
- Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
- Support chunked prefill for multimodal models (#6843)
- Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
- Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
- Add deepseek r1-w4afp8 quickstart (#7645)
- Nanobind: Allow none types for fields in result (#7672)
- Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
- UCX zmq ip support ipv6 (#7530)
- Refactor: Quantization Transforms with Inheritance (#7227)
What's Changed
- [None][chore] Remove closed bugs by @xinhe-nv in #7591
- [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
- [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
- [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
- [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
- [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
- [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
- [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
- [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
- [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
- [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
- [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
- [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
- [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
- [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
- [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
- [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
- [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
- [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
- [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
- [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
- [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
- [None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
- [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
- [None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
- [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
- [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
- [None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
- [None][ci] move some test cases from l40s to a30 by @QiJune in #7684
- [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
- [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
- [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
- [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
- [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
- [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
- [None][ci] Some improvements for Slurm CI by @chzblych in #7689
- [None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
- [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
- [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
- [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
- [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
- [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
- [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
- [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
- [None][test] add test for min_tokens by @ixlmar in #7678
- [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
- [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
- [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
- [None][ci] Test waives for the main branch 09/15 by @chzblych in #7709
New Contributors
Full Changelog: v1.1.0rc4...v1.1.0rc5
v1.1.0rc4
Announcement Highlights:
- Model Support
- API
- Benchmark
- Test trtllm-serve with --extra_llm_api_options (#7492)
- Feature
- Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
- Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
- Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
- Separate run_shape_prop as another graph utility (#7313)
- MultiLayer Eagle (#7234)
- Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
- Add NVFP4 x FP8 (#6809)
- Support hashing and KV cache reuse for videos (#7360)
- Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
- Introduce QKNormRoPEAttention module (#6830)
- AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
- Support KV cache salting for secure KV cache reuse (#7106)
- trtllm-gen kernels support sm103 (#7570)
- Move stop_criteria to sample_async (#7041)
- KV cache transfer for uneven pp (#7117)
- Update multimodal utility
get_num_tokens_per_image
for better generalization (#7544) - AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
- Add Request specific exception (#6931)
- Add DeepSeek-v3-0324 e2e torch test (#7413)
- Add 8-GPU test cases for RTX6000 (#7083)
- add gptoss 20g tests (#7361)
- Nixl support for GDS (#5488)
- CMake option to link statically with cublas/curand (#7178)
- Extend VLM factory and add Mistral3 factory (#7583)
- Documentation
- fix example in docstring (#7410)
- Fix formatting error in Gemma3 readme (#7352)
- Add note about trtllm-serve to the devel container (#7483)
- add GPT OSS Eagle3 blog (#7140)
- 1.0 Documentation. (#6696)
- Update kvcache part (#7549)
- Rename TensorRT-LLM to TensorRT LLM. (#7554)
- refine docs for accuracy evaluation of gpt-oss models (#7252)
What's Changed
- [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
- [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
- [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
- [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
- [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
- [None][doc] fix example in docstring by @tomeras91 in #7410
- [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
- [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
- [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
- [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
- [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
- [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
- [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
- [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
- [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
- [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
- [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
- [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
- [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
- [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
- [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
- [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
- [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
- [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
- [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
- [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
- [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
- [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
- [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
- [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
- [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
- [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
- [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
- [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
- [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
- [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
- [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
- [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
- [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
- [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
- [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
- [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
- [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
- [None][test] update nim and full test list by @crazydemo in #7468
- [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
- [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
- [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
- [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
- [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
- [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
- [None][feat] Add Request specific exception by @Shunkangz in #6931
- [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
- [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
- [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
- [None][chore] Remove closed bugs by @xinhe-nv in #7408
- [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
- [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
- [None][infra] update nspect version by @niukuo in #7552
*...
v1.1.0rc2.post2
Announcement Highlights
- Feature
- Add MNNVL AlltoAll tests to pre-merge (#7465)
- Support multi-threaded tokenizers for trtllm-serve (#7515)
- FP8 Context MLA integration (#7581)
- Support block wise FP8 in wide ep (#7423)
- Cherry-pick Responses API and multiple postprocess workers support for chat harmony (#7600)
- Make low_precision_combine as a llm arg (#7598)
- Documentation
- Update deployment guide and cherry-pick CI test fix from main (#7623)
What's Changed
- [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7465
- [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve by @nv-yilinf in #7515
- [None][fix] trtllm-serve yaml loading by @Superjomn in #7551
- [None][chore] Bump version to 1.1.0rc2.post2 by @yiqingy0 in #7582
- [https://nvbugs/5498967][fix] Downgrade NCCL by @yizhang-nv in #7556
- [TRTLLM-6994][feat] FP8 Context MLA integration. by @yuxianq in #7581
- [TRTLLM-7831][feat] Support block wise FP8 in wide ep by @xxi-nv in #7423
- [None][chore] Make use_low_precision_moe_combine as a llm arg by @zongfeijing in #7598
- [None][fix] Update deployment guide and cherry-pick CI test fix from main by @dongfengy in #7623
- [None][feat] Cherry-pick Responses API and multiple postprocess workers support for chat harmony by @JunyiXu-nv in #7600
- [None][chore] Fix kernel launch param and add TRTLLM MoE backend test by @pengbowang-nv in #7524
New Contributors
Full Changelog: v1.1.0rc2.post1...v1.1.0rc2.post2
v1.1.0rc2.post1
Announcement Highlights:
- API
- Update TargetInfo to accommodate CP in disagg (#7224)
- Benchmark
- Minor fixes to slurm and benchmark scripts (#7453)
- Feature
- Documentation
What's Changed
- [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
- [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
- [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
- [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
- [None][chore] bump version to 1.1.0rc2.post1 by @litaotju in #7396
- [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local re… by @zongfeijing in #7422
- [None] [fix] Fix nsys in slurm scripts by @kaiyux in #7409
- [None][feat] Support DeepGEMM swap-AB on sm100 by @Barry-Delaney in #7355
- [None] [fix] Minor fixes to slurm and benchmark scripts by @kaiyux in #7453
- [None][fix] Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7507
- [TRTLLM-7008][fix] Add automatic shared memory delete if already exist by @dongxuy04 in #7377
- [None][ci] Cherry-pick some improvements for Slurm CI setup from main branch by @chzblych in #7479
- [https://nvbugs/5481434][feat] Reuse pytorch memory segments occupied by cudagraph pool by @HuiGao-NV in #7457
- [None][fix] Update DG side branch name by @Barry-Delaney in #7491
- [None][fix] Update DG commit by @Barry-Delaney in #7534
- [None][fix] Fix a typo in the Slurm CI codes (#7485) by @chzblych in #7538
- [https://nvbugs/5488582][fix] Avoid unexpected Triton recompilation in DG fused_moe. by @hyukn in #7495
- [None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. by @hyukn in #7420
- [None][opt] Add batch waiting when scheduling by @yunruis in #7287
- [https://nvbugs/5485325][fix] Add a postprocess to the model engine to fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7373
- [None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch by @timlee0212 in #7487
New Contributors
- @AndyDai-nv made their first contribution in #7137
Full Changelog: v1.1.0rc2...v1.1.0rc2.post1
v1.1.0rc3
Announcement Highlights:
- Model Support
- Add fp8 support for Mistral Small 3.1 (#6731)
- Benchmark
- Feature
- Update TargetInfo to accommodate CP in disagg (#7224)
- Merge add sparse exp and shared exp into local reduction (#7369)
- Support NVFP4 KV Cache (#6244)
- Allocate MoE workspace only when necessary (release/1.0 retargeted) (#6955)
- Implement capturable drafting loops for speculation (#7100)
- Revert phi4-mm aggregate mode (#6907)
- Complete the last missing allreduce op in Llama3/4. (#6850)
- Documentation
What's Changed
- [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
- [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
- [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
- [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
- [None][chore] Bump version to 1.1.0rc3 by @yiqingy0 in #7394
- [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local reduction by @zongfeijing in #7369
- [None][feat] Support NVFP4 KV Cache by @Tom-Zheng in #6244
- [None][ci] Some improvements for Slurm CI setup by @chzblych in #7407
- [None][chore] Mass integration of release/1.0 - 2nd by @dominicshanshan in #7171
- [None][test] Update case that not support passing quantization fp8 for pytorch backend by @nvamyt in #7302
- [None][infra] Disable GB200-PyTorch-1 due to OOM issue by @yuanjingx87 in #7386
- [https://nvbugs/5481087][fix] fix bug of ci when we use mocker by @byshiue in #7332
- [None][infra] Waive failed case on main 0901 by @EmmaQiaoCh in #7447
- [TRTLLM-7353][feat] Implement capturable drafting loops for speculation by @mikeiovine in #7100
- [None] [doc] Update DeepSeek example doc by @jiahanc in #7358
- [None][fix] Fix nanobind failure by @Tom-Zheng in #7425
- [None][chore] Use llm args in create_py_executor by @leslie-fang25 in #7239
New Contributors
- @AndyDai-nv made their first contribution in #7137
Full Changelog: v1.1.0rc2...v1.1.0rc3
v1.1.0rc2
Announcement Highlights:
-
Model Support
- Refactor llama4 for multimodal encoder IFB (#6844)
-
API
- Add standalone multimodal encoder (#6743)
- Enable Cross-Attention to use XQA kernels for Whisper (#7035)
- Enable nanobind as the default binding library (#6608)
- trtllm-serve + autodeploy integration (#7141)
- Chat completions API for gpt-oss (#7261)
- KV Cache Connector API (#7228)
- Create PyExecutor from TorchLlmArgs Part 1 (#7105)
- TP Sharding read from the model config (#6972)
-
Benchmark
- add llama4 tp4 tests (#6989)
- add test_multi_nodes_eval tests (#7108)
- nsys profile output kernel classifier (#7020)
- add kv cache size in bench metric and fix failed cases (#7160)
- add perf metrics endpoint to openai server and openai disagg server (#6985)
- add gpt-osss tests to sanity list (#7158)
- add l20 specific qa test list (#7067)
- Add beam search CudaGraph + Overlap Scheduler tests (#7326)
- Update qwen3 timeout to 60 minutes (#7200)
- Update maxnt of llama_v3.2_1b bench (#7279)
- Improve performance of PyTorchModelEngine._get_lora_params_from_requests (#7033)
- Accelerate global scale calculations for deepEP fp4 combine (#7126)
- Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238)
- Balance the request based on number of tokens in AttentionDP (#7183)
- Wrap the swiglu into custom op to avoid redundant device copy (#7021)
-
Feature
- Add QWQ-32b torch test (#7284)
- Fix llama4 multimodal by skipping request validation (#6957)
- Add group attention pattern for solar-pro-preview (#7054)
- Add Mistral Small 3.1 multimodal in Triton Backend (#6714)
- Update lora for phi4-mm (#6817)
- refactor the CUDA graph runner to manage all CUDA graphs (#6846)
- Enable chunked prefill for Nemotron-H (#6334)
- Add customized default routing method (#6818)
- Testing cache transmission functionality in Python (#7025)
- Simplify decoder state initialization for speculative decoding (#6869)
- Support MMMU for multimodal models (#6828)
- Deepseek: Start Eagle work (#6210)
- Optimize and refactor alltoall in WideEP (#6973)
- Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time (#7113)
- Hopper Fp8 context mla (#7116)
- Padding for piecewise cudagraph (#6750)
- Add low precision all2all for mnnvl (#7155)
- Use numa to bind CPU (#7304)
- Skip prefetching consolidated safetensors when appropriate (#7013)
- Unify sampler handle logits implementation (#6867)
- Move fusion, kvcache, and compile to modular inference optimizer (#7057)
- Make finalize fusion part of the tactic selection logic (#6915)
- Fuse slicing into MoE (#6728)
- Add logging for OAI disagg server (#7232)
-
Documentation
- Update gpt-oss deployment guide to latest release image (#7101)
- update stale link for AutoDeploy (#7135)
- Add GPT-OSS Deployment Guide into official doc site (#7143)
- Refine GPT-OSS doc (#7180)
- update feature_combination_matrix doc (#6691)
- update disagg doc about UCX_MAX_RNDV_RAILS (#7205)
- Display tech blog for nvidia.github.io domain (#7241)
- Updated blog9_Deploying_GPT_OSS_on_TRTLLM (#7260)
- Update autodeploy README.md, deprecate lm_eval in examples folder (#7233)
- add adp balance blog (#7213)
- fix doc formula (#7367)
- update disagg readme and scripts for pipeline parallelism (#6875)
What's Changed
- [None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
- [None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
- [None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
- [None][fix] fix llmapi import error by @crazydemo in #7030
- [TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
- [None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
- [TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
- [None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
- [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
- [None][fix] fix scaffolding dynasor test by @dc3671 in #7070
- [None][chore] Update namelist in blossom-ci by @karljang in #7015
- [None][ci] move unittests to sub-directories by @Funatiq in #6635
- [None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
- [None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
- [TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
- [None][chore] Only check the bindings lib for current build by @liji-nv in #7026
- [None][ci] move some tests of b200 to post merge by @QiJune in #7093
- [https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
- [TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
- [None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
- [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
- [None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
- [None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
- [None][chore] waive failed cases on H100 by @xinhe-nv in #7084
- [None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
- [https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
- [https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
- [None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
- [https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
- [https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
- [None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
- [https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
- [None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
- [#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
- [None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
- [None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
- [None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
- [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
- [None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
- [TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
- [TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
- [None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
- [TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
- [TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
- [None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
- [None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
- [#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
- [TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
- [None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
- [None][feat] Deepseek: Start Eag...
v1.1.0rc1
Announcement Highlights:
-
Model Support
-
API
-
Benchmark
-
Feature
-
Documentation
What's Changed
- [https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
- [https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
- [None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
- [None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
- [None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
- [https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
- [https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
- [None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
- [None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
- [None][fix] Fix perfect router. by @bobboli in #6797
- [https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
- [None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
- [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
- [None][doc] Modify the description for mla chunked context by @jmydurant in #6929
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
- [None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
- [https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
- [https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
- [None] [chore] Mamba cache in separate file by @tomeras91 in #6796
- [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
- [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
- [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
- [None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
- [None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
- [TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
- [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
- [None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
- [TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
- [None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
- [None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
- [None][feat] Support Yarn on Qwen3 by @byshiue in #6785
- [None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
- [None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
- [https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
- [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
- [None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
- [None][doc] Update gpt oss doc by @bobboli in #6954
- [None] [feat] Support accurate device iter time by @kaiyux in #6906
- [TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
- [None] [fix] Fix the macro name by @ChristinaZ in #6983
- [None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
- [None][chore] Remove duplicate test waives by @yiqingy0 in #6998
- [None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
- [None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
- [TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
- [None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
- [None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
- [https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
- [TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
- [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
- [None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
- [TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
- [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
- [TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
- [https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
- [TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
- [https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
- [None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
- [None][chore] Remove closed bugs by @xinhe-nv in #6969
- [None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
- [None] [doc] Add more documents for large scale EP by @kaiyux in #7029
- [None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
- [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
- [https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
- [https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
- [None][chore] Remo...
v1.1.0rc0
Announcement Highlights:
-
Model Support
- Add model gpt-oss (#6645)
- Support Aggregate mode for phi4-mm (#6184)
- Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
- Support running heterogeneous model execution for Nemotron-H (#6866)
- Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
-
API
- BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
-
Benchmark
-
Feature
- Support LoRA reload CPU cache evicted adapter (#6510)
- Add FP8 context MLA support for SM120 (#6059)
- Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
- Include attention dp rank info with KV cache events (#6563)
- Clean up ngram auto mode, add max_concurrency to configs (#6676)
- Add NCCL Symmetric Integration for All Reduce (#4500)
- Remove input_sf swizzle for module WideEPMoE (#6231)
- Enable guided decoding with disagg serving (#6704)
- Make fused_moe_cute_dsl work on blackwell (#6616)
- Move kv cache measure into transfer session (#6633)
- Optimize CUDA graph memory usage for spec decode cases (#6718)
- Core Metrics Implementation (#5785)
- Resolve KV cache divergence issue (#6628)
- AutoDeploy: Optimize prepare_inputs (#6634)
- Enable FP32 mamba ssm cache (#6574)
- Support SharedTensor on MultimodalParams (#6254)
- Improve dataloading for benchmark_dataset by using batch processing (#6548)
- Store the block of context request into kv cache (#6683)
- Add standardized GitHub issue templates and disable blank issues (#6494)
- Improve the performance of online EPLB on Hopper by better overlapping (#6624)
- Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
- CUTLASS MoE FC2+Finalize fusion (#3294)
- Add GPT OSS support for AutoDeploy (#6641)
- Add LayerNorm module (#6625)
- Support custom repo_dir for SLURM script (#6546)
- DeepEP LL combine FP4 (#6822)
- AutoTuner tuning config refactor and valid tactic generalization (#6545)
- Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
- Add support for Hopper MLA chunked prefill (#6655)
- Helix: extend mapping to support different CP types (#6816)
-
Documentation
- Remove the outdated features which marked as Experimental (#5995)
- Add LoRA feature usage doc (#6603)
- Add deployment guide section for VDR task (#6669)
- Add doc for multimodal feature support matrix (#6619)
- Move AutoDeploy README.md to torch docs (#6528)
- Add checkpoint refactor docs (#6592)
- Add K2 tool calling examples (#6667)
- Add the workaround doc for H200 OOM (#6853)
- Update moe support matrix for DS R1 (#6883)
- BREAKING CHANGE: Mismatch between docs and actual commands (#6323)
What's Changed
- Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
- [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
- [None][chore] update readme for perf release test by @ruodil in #6664
- [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
- [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
- [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
- [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
- [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
- [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
- [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
- [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
- [None][test] correct test-db context for perf yaml file by @ruodil in #6686
- [None] [feat] Add model gpt-oss by @hlu1 in #6645
- [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
- [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
- [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
- [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
- [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
- [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
- [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
- [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
- [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
- [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
- [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
- [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
- [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
- [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
- [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
- [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
- [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
- [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
- [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
- [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
- [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
- [None][test] fix yml condition error under qa folder by @ruodil in #6734
- [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
- [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
- [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
- [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
- [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
- [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
- [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
- [None][fix]revert kvcache transfer by @chuangz0 in #6709
- [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
- [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
- [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify
examples
mapping by @venkywonka in #6762 - [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
- [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
- [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
- [None][feat] Core Metrics Implementation by @hcyezhang in #5785
- [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
- [TRTLLM-6637][feat]...
v1.0.0rc6
Announcement Highlights:
-
Model Support
-
Feature
- Add LoRA support for Gemma3 (#6371)
- Add support of scheduling attention dp request (#6246)
- Multi-block mode for Hopper spec dec XQA kernel (#4416)
- LLM sleep & wakeup Part 1: virtual device memory (#5034)
- best_of/n for pytorch workflow (#5997)
- Add speculative metrics for trt llm bench (#6476)
- (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
- check input tokens + improve error handling (#5170)
- Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
- Add vLLM KV Pool support for XQA kernel (#6013)
- Switch to internal version of MMProjector in Gemma3 (#6572)
- Enable fp8 SwiGLU to minimize host overhead (#6540)
- Add Qwen3 MoE support to TensorRT backend (#6470)
- ucx establish connection with zmq (#6090)
- Disable add special tokens for Llama3.3 70B (#6482)
-
API
-
Benchmark
-
Documentation
- Make example SLURM scripts more parameterized (#6511)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
- Exposing the latest tech blogs in README.md (#6553)
- update known issues (#6247)
- trtllm-serve doc improvement. (#5220)
- Adding GPT-OSS Deployment Guide documentation (#6637)
- Exposing the GPT OSS model support blog (#6647)
- Add llama4 hybrid guide (#6640)
- Add DeepSeek R1 deployment guide. (#6579)
- Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
-
Known Issues
- On bare-metal Ubuntu 22.04 or 24.04, please install the
cuda-python==12.9.1
package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of errorImportError: cannot import name 'cuda' from 'cuda'
.
- On bare-metal Ubuntu 22.04 or 24.04, please install the
What's Changed
- [fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
- [TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
- [fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
- refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
- chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
- fix: Fix missing key by @zerollzeng in #6471
- [https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
- [TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
- [https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
- [None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
- [None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
- [None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
- [None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
- [None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
- [https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
- [TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
- [https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
- [None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
- [AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
- [TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
- [TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
- [None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
- [None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
- [None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
- [None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
- [TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
- use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
- [None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
- [TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
- [fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
- chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
- test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
- [None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
- [None][infra] update namelist by @niukuo in #6465
- [https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
- [None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
- test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
- [TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
- [None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
- [None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
- [TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
- [None][fix] remove closed bugs by @xinhe-nv in #6576
- [None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
- [None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
- [None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
- [None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
- [TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
- [None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
- [None][test] update invalid test name by @crazydemo in #6596
- [TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
- [None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
- [TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
- [None][doc] Fix blog4 typo by @syuoni in #6612
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
- [TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
- [https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
- [TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
- [None][chore] Add readme for perf test by @ruodil in #6443
- [https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
- [None][chore] ucx establish connection with zmq by @chuangz0 in #6090
- [TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
- [None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
- [TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in h...
v1.0.0rc5
Announcement Highlights:
- Model Support
- Feature
- Deepseek R1 FP8 Support on Blackwell (#6486)
- Auto-enable ngram with concurrency <= 32. (#6232)
- Support turning on/off spec decoding dynamically (#6363)
- Improve LoRA cache memory control (#6220)
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
- Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
- Add support for external multimodal embeddings (#6263)
- Add support for disaggregation with pp with pytorch backend (#6369)
- Add _prepare_and_schedule_batch function in PyExecutor (#6365)
- Add status tags to LLM API reference (#5707)
- Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
- Support JSON Schema in OpenAI-Compatible API (#6321)
- Support chunked prefill on spec decode 2 model (#6104)
- Enhance beam search support with CUDA graph integration (#6217)
- Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
- Add KV cache reuse support for multimodal models (#5444)
- Multistream initial support for torch compile flow (#5847)
- Support nanobind bindings (#6185)
- Support Weight-Only-Quantization in PyTorch Workflow (#5850)
- Support pytorch LoRA adapter eviction (#5616)
- API
- [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
- Bug Fixes
- fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
- fix illeagel memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Switch placement of image placeholder for mistral 3.1 (#6435)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
- Benchmark
- Performance
- Infrastructure
- Documentation
- Known Issues
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the
CUDA_HOME
environment variable - The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the
What's Changed
- DeepEP LL support variable hidden size and tokens num by @yilin-void in #6141
- [Fix][Chore][Qwen3] fix bug of using fp4 on sm120 by @byshiue in #6065
- fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings by @MartinMarciniszyn in #6189
- [TRTLLM-5826][feat] Support pytorch LoRA adapter eviction by @amitz-nv in #5616
- W4A8 GEMM by @danielafrimi in #6005
- enh: Lift expectation of single image per sample in Gemma3 VLM by @brb-nv in #6195
- test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test by @ruodil in #5826
- fix: Flush stale
PlanParams
with custom attention mask by @brb-nv in #6163 - doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… by @nv-guomingz in #6150
- [fix] Fix can_use_alltoall in fused_moe_wide_ep.py by @jinyangyuan-nvidia in #6173
- [TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #5850
- test: [CI] remove closed bugs by @xinhe-nv in #6201
- feat: nanobind bindings by @Linda-Stadter in #6185
- infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) by @ZhanruiSunCh in #4656
- doc: add Deprecation Policy section by @QiJune in #5784
- [TRTLLM-4279] feat: Multistream initial support for torch compile flow by @liji-nv in #5847
- [Infra] - Waive failed cases on recent post-merge by @EmmaQiaoCh in #6212
- [BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve by @LinPoly in #5717
- test: Enable GB200 torch compile multi gpu tests by @yizhang-nv in #6145
- [fix] Correct the returned value of has_spec_drafter by @ziyixiong-nv in #6178
- [chore] Clean up quickstart_advanced.py by @mikeiovine in #6021
- [Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] by @SimengLiu-nv in #5859
- [TRTLLM-5059][feat] Add KV cache reuse support for multimodal models by @chang-l in #5444
- feat: Refactor the fetching request logic by @Shunkangz in #5786
- tests: add timeout_manager to tensorrt flow test cases by @crazydemo in #5942
- feat: moe prepare support topk % 4 != 0 by @WeiHaocheng in #5742
- [fix] Fix flaky mistral E2E test by @2ez4bz in #6230
- bug: [https://nvbugs/5368507] Fix test_generate_with_seed. by @bobboli in #6206
- chore: Mass integration of release/0.21 (part 4) by @dc3671 in #6211
- doc: add supported data modality and types on multimodal serve by @yechank-nvidia in #5988
- chore: bump version to 1.0.0rc5 by @yiqingy0 in #6252
- [TRTLLM-6537][infra] extend multi-gpu tests related file list by @reasonsolo in #6139
- test: update test list for RTX6KD by @StanleySun639 in #6213
- fix: bindings unit tests for nanobind by @Linda-Stadter in #6221
- Add register_fake for finegrained_mixed_dtype_gemm torch_op by @danielafrimi in #6255
- [Issue 6193] Fix gemma3vl weight loader by @johncalesp in #6233
- [feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM by @2ez4bz in #6152
- set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only by @yuanjingx87 in #6234
- [nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency by @raayandhar in #6222
- [AutoDeploy] merge feat/ad-2025-07-07 by @lucaslie in #6196
- [nvbugs/5401261][fix] Fix Triton backend disaggregated serving support by @Tabrizian in #6224
- [refactor] Simplification of Speculative decoding configs - Part 2 by @wili-65535 in #5936
- doc: Refactor documents and examples of disaggregated serving and wide ep by @kaiyux in #6054
- Add basic Nemo Ckpt Lora Loading in pytorch flow by @venkywonka in #6019
- [https://nvbugs/5387771] fix deadlocks due to insufficient numSemaphores by @PerkzZheng in #6262
- fix: nvbug_5398806 by @hchings in #6239
- chore: set default device to cpu on Multimodal models by @yechank-nvidia in #5994
- chore: remove duplicate should_stop_processing check by @QiJune in #6242
- hopper-style context MLA by @zhou-yuxin in #5713
- [nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue by @yweng0828 in #6136
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6289
- [TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler by @stnie in #6223
- [Infra] - Skip failed cases by @EmmaQiaoCh in #6299
- [AutoDeploy] disable flaky MoE nvfp4 test by @lucaslie in #6302
- [feat] Update .coderabbit.yaml with review settings and code guidelines by @venkywonka in #6251
- Waive tests by @Tabrizian in https://github.com/NVIDIA/Ten...