Skip to content

Releases: NVIDIA/TensorRT-LLM

v1.1.0rc5

18 Sep 01:49
0c9430e
Compare
Choose a tag to compare
v1.1.0rc5 Pre-release
Pre-release

Announcement Highlights

  • Model Support
    • Enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
    • Enable KV-cache reuse and add E2E tests for llava-next (#7349)
    • Support gpt-oss with fp8 kv cache (#7612)
    • Support kvcache reuse for phi4mm (#7563)
  • API
    • Add TorchLlmArgs to the connector api (#7493)
  • Benchmark
    • Extend test_perf.py to add disagg-serving perf tests (#7503)
    • Add accuracy test for deepseek-r1 with chunked_prefill (#7365)
  • Feature
    • Optimize MLA kernels with separate reduction kernels (#7597)
    • Wrap MOE with custom op (#7277)
    • Make the should_use_spec_decode logic a bit smarter (#7112)
    • Use a shell context to install dependancies (#7383)
    • Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
    • Support chunked prefill for multimodal models (#6843)
    • Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
    • Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
    • Add deepseek r1-w4afp8 quickstart (#7645)
    • Nanobind: Allow none types for fields in result (#7672)
    • Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
    • UCX zmq ip support ipv6 (#7530)
    • Refactor: Quantization Transforms with Inheritance (#7227)

What's Changed

  • [None][chore] Remove closed bugs by @xinhe-nv in #7591
  • [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
  • [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
  • [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
  • [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
  • [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
  • [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
  • [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
  • [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
  • [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
  • [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
  • [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
  • [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
  • [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
  • [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
  • [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
  • [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
  • [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
  • [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
  • [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
  • [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
  • [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
  • [None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
  • [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
  • [None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
  • [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
  • [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
  • [None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
  • [None][ci] move some test cases from l40s to a30 by @QiJune in #7684
  • [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
  • [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
  • [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
  • [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
  • [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
  • [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
  • [None][ci] Some improvements for Slurm CI by @chzblych in #7689
  • [None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
  • [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
  • [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
  • [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
  • [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
  • [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
  • [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
  • [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
  • [None][test] add test for min_tokens by @ixlmar in #7678
  • [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
  • [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
  • [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
  • [None][ci] Test waives for the main branch 09/15 by @chzblych in #7709

New Contributors

Full Changelog: v1.1.0rc4...v1.1.0rc5

v1.1.0rc4

10 Sep 07:32
62b564a
Compare
Choose a tag to compare
v1.1.0rc4 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Support phi-4 model in pytorch backend (#7371)
    • Support Aggregate mode for phi4-mm (#7521)
  • API
    • Implement basic functionalities for Responses API (#7341)
    • Support multiple postprocess workers for chat completions API (#7508)
    • Report failing requests (#7060)
  • Benchmark
    • Test trtllm-serve with --extra_llm_api_options (#7492)
  • Feature
    • Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
    • Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
    • Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
    • Separate run_shape_prop as another graph utility (#7313)
    • MultiLayer Eagle (#7234)
    • Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
    • Add NVFP4 x FP8 (#6809)
    • Support hashing and KV cache reuse for videos (#7360)
    • Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
    • Introduce QKNormRoPEAttention module (#6830)
    • AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
    • Support KV cache salting for secure KV cache reuse (#7106)
    • trtllm-gen kernels support sm103 (#7570)
    • Move stop_criteria to sample_async (#7041)
    • KV cache transfer for uneven pp (#7117)
    • Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
    • AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
    • Add Request specific exception (#6931)
    • Add DeepSeek-v3-0324 e2e torch test (#7413)
    • Add 8-GPU test cases for RTX6000 (#7083)
    • add gptoss 20g tests (#7361)
    • Nixl support for GDS (#5488)
    • CMake option to link statically with cublas/curand (#7178)
    • Extend VLM factory and add Mistral3 factory (#7583)
  • Documentation
    • fix example in docstring (#7410)
    • Fix formatting error in Gemma3 readme (#7352)
    • Add note about trtllm-serve to the devel container (#7483)
    • add GPT OSS Eagle3 blog (#7140)
    • 1.0 Documentation. (#6696)
    • Update kvcache part (#7549)
    • Rename TensorRT-LLM to TensorRT LLM. (#7554)
    • refine docs for accuracy evaluation of gpt-oss models (#7252)

What's Changed

  • [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
  • [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
  • [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
  • [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
  • [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
  • [None][doc] fix example in docstring by @tomeras91 in #7410
  • [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
  • [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
  • [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
  • [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
  • [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
  • [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
  • [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
  • [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
  • [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
  • [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
  • [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
  • [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
  • [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
  • [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
  • [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
  • [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
  • [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
  • [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
  • [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
  • [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
  • [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
  • [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
  • [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
  • [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
  • [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
  • [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
  • [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
  • [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
  • [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
  • [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
  • [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
  • [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
  • [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
  • [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
  • [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
  • [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
  • [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
  • [None][test] update nim and full test list by @crazydemo in #7468
  • [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
  • [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
  • [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
  • [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
  • [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
  • [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
  • [None][feat] Add Request specific exception by @Shunkangz in #6931
  • [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
  • [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
  • [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
  • [None][chore] Remove closed bugs by @xinhe-nv in #7408
  • [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
  • [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
  • [None][infra] update nspect version by @niukuo in #7552
    *...
Read more

v1.1.0rc2.post2

15 Sep 05:11
ef0d06d
Compare
Choose a tag to compare
v1.1.0rc2.post2 Pre-release
Pre-release

Announcement Highlights

  • Feature
    • Add MNNVL AlltoAll tests to pre-merge (#7465)
    • Support multi-threaded tokenizers for trtllm-serve (#7515)
    • FP8 Context MLA integration (#7581)
    • Support block wise FP8 in wide ep (#7423)
    • Cherry-pick Responses API and multiple postprocess workers support for chat harmony (#7600)
    • Make low_precision_combine as a llm arg (#7598)
  • Documentation
    • Update deployment guide and cherry-pick CI test fix from main (#7623)

What's Changed

  • [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7465
  • [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve by @nv-yilinf in #7515
  • [None][fix] trtllm-serve yaml loading by @Superjomn in #7551
  • [None][chore] Bump version to 1.1.0rc2.post2 by @yiqingy0 in #7582
  • [https://nvbugs/5498967][fix] Downgrade NCCL by @yizhang-nv in #7556
  • [TRTLLM-6994][feat] FP8 Context MLA integration. by @yuxianq in #7581
  • [TRTLLM-7831][feat] Support block wise FP8 in wide ep by @xxi-nv in #7423
  • [None][chore] Make use_low_precision_moe_combine as a llm arg by @zongfeijing in #7598
  • [None][fix] Update deployment guide and cherry-pick CI test fix from main by @dongfengy in #7623
  • [None][feat] Cherry-pick Responses API and multiple postprocess workers support for chat harmony by @JunyiXu-nv in #7600
  • [None][chore] Fix kernel launch param and add TRTLLM MoE backend test by @pengbowang-nv in #7524

New Contributors

Full Changelog: v1.1.0rc2.post1...v1.1.0rc2.post2

v1.1.0rc2.post1

06 Sep 00:06
9d6e87a
Compare
Choose a tag to compare
v1.1.0rc2.post1 Pre-release
Pre-release

Announcement Highlights:

  • API
    • Update TargetInfo to accommodate CP in disagg (#7224)
  • Benchmark
    • Minor fixes to slurm and benchmark scripts (#7453)
  • Feature
    • Support DeepGEMM swap-AB on sm100 (#7355)
    • Merge add sparse exp and shared exp into local re… (#7422)
    • Add batch waiting when scheduling (#7287)
    • Reuse pytorch memory segments occupied by cudagraph pool (#7457)
    • Complete the last missing allreduce op in Llama3/4 (#7420)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • store blog 10 media via lfs (#7375)

What's Changed

  • [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
  • [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
  • [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
  • [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
  • [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
  • [None][chore] bump version to 1.1.0rc2.post1 by @litaotju in #7396
  • [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local re… by @zongfeijing in #7422
  • [None] [fix] Fix nsys in slurm scripts by @kaiyux in #7409
  • [None][feat] Support DeepGEMM swap-AB on sm100 by @Barry-Delaney in #7355
  • [None] [fix] Minor fixes to slurm and benchmark scripts by @kaiyux in #7453
  • [None][fix] Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7507
  • [TRTLLM-7008][fix] Add automatic shared memory delete if already exist by @dongxuy04 in #7377
  • [None][ci] Cherry-pick some improvements for Slurm CI setup from main branch by @chzblych in #7479
  • [https://nvbugs/5481434][feat] Reuse pytorch memory segments occupied by cudagraph pool by @HuiGao-NV in #7457
  • [None][fix] Update DG side branch name by @Barry-Delaney in #7491
  • [None][fix] Update DG commit by @Barry-Delaney in #7534
  • [None][fix] Fix a typo in the Slurm CI codes (#7485) by @chzblych in #7538
  • [https://nvbugs/5488582][fix] Avoid unexpected Triton recompilation in DG fused_moe. by @hyukn in #7495
  • [None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. by @hyukn in #7420
  • [None][opt] Add batch waiting when scheduling by @yunruis in #7287
  • [https://nvbugs/5485325][fix] Add a postprocess to the model engine to fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7373
  • [None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch by @timlee0212 in #7487

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc2.post1

v1.1.0rc3

04 Sep 08:24
e81c50d
Compare
Choose a tag to compare
v1.1.0rc3 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Add fp8 support for Mistral Small 3.1 (#6731)
  • Benchmark
    • add benchmark TRT flow test for MIG (#6884)
    • Mistral Small 3.1 accuracy tests (#6909)
  • Feature
    • Update TargetInfo to accommodate CP in disagg (#7224)
    • Merge add sparse exp and shared exp into local reduction (#7369)
    • Support NVFP4 KV Cache (#6244)
    • Allocate MoE workspace only when necessary (release/1.0 retargeted) (#6955)
    • Implement capturable drafting loops for speculation (#7100)
    • Revert phi4-mm aggregate mode (#6907)
    • Complete the last missing allreduce op in Llama3/4. (#6850)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • Add docs for Gemma3 VLMs (#6880)
    • add legacy section for tensorrt engine (#6724)
    • Update DeepSeek example doc (#7358)

What's Changed

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc3

v1.1.0rc2

31 Aug 02:22
15ec2b8
Compare
Choose a tag to compare
v1.1.0rc2 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Refactor llama4 for multimodal encoder IFB (#6844)
  • API

    • Add standalone multimodal encoder (#6743)
    • Enable Cross-Attention to use XQA kernels for Whisper (#7035)
    • Enable nanobind as the default binding library (#6608)
    • trtllm-serve + autodeploy integration (#7141)
    • Chat completions API for gpt-oss (#7261)
    • KV Cache Connector API (#7228)
    • Create PyExecutor from TorchLlmArgs Part 1 (#7105)
    • TP Sharding read from the model config (#6972)
  • Benchmark

    • add llama4 tp4 tests (#6989)
    • add test_multi_nodes_eval tests (#7108)
    • nsys profile output kernel classifier (#7020)
    • add kv cache size in bench metric and fix failed cases (#7160)
    • add perf metrics endpoint to openai server and openai disagg server (#6985)
    • add gpt-osss tests to sanity list (#7158)
    • add l20 specific qa test list (#7067)
    • Add beam search CudaGraph + Overlap Scheduler tests (#7326)
    • Update qwen3 timeout to 60 minutes (#7200)
    • Update maxnt of llama_v3.2_1b bench (#7279)
    • Improve performance of PyTorchModelEngine._get_lora_params_from_requests (#7033)
    • Accelerate global scale calculations for deepEP fp4 combine (#7126)
    • Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238)
    • Balance the request based on number of tokens in AttentionDP (#7183)
    • Wrap the swiglu into custom op to avoid redundant device copy (#7021)
  • Feature

    • Add QWQ-32b torch test (#7284)
    • Fix llama4 multimodal by skipping request validation (#6957)
    • Add group attention pattern for solar-pro-preview (#7054)
    • Add Mistral Small 3.1 multimodal in Triton Backend (#6714)
    • Update lora for phi4-mm (#6817)
    • refactor the CUDA graph runner to manage all CUDA graphs (#6846)
    • Enable chunked prefill for Nemotron-H (#6334)
    • Add customized default routing method (#6818)
    • Testing cache transmission functionality in Python (#7025)
    • Simplify decoder state initialization for speculative decoding (#6869)
    • Support MMMU for multimodal models (#6828)
    • Deepseek: Start Eagle work (#6210)
    • Optimize and refactor alltoall in WideEP (#6973)
    • Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time (#7113)
    • Hopper Fp8 context mla (#7116)
    • Padding for piecewise cudagraph (#6750)
    • Add low precision all2all for mnnvl (#7155)
    • Use numa to bind CPU (#7304)
    • Skip prefetching consolidated safetensors when appropriate (#7013)
    • Unify sampler handle logits implementation (#6867)
    • Move fusion, kvcache, and compile to modular inference optimizer (#7057)
    • Make finalize fusion part of the tactic selection logic (#6915)
    • Fuse slicing into MoE (#6728)
    • Add logging for OAI disagg server (#7232)
  • Documentation

    • Update gpt-oss deployment guide to latest release image (#7101)
    • update stale link for AutoDeploy (#7135)
    • Add GPT-OSS Deployment Guide into official doc site (#7143)
    • Refine GPT-OSS doc (#7180)
    • update feature_combination_matrix doc (#6691)
    • update disagg doc about UCX_MAX_RNDV_RAILS (#7205)
    • Display tech blog for nvidia.github.io domain (#7241)
    • Updated blog9_Deploying_GPT_OSS_on_TRTLLM (#7260)
    • Update autodeploy README.md, deprecate lm_eval in examples folder (#7233)
    • add adp balance blog (#7213)
    • fix doc formula (#7367)
    • update disagg readme and scripts for pipeline parallelism (#6875)

What's Changed

  • [None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
  • [None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
  • [None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
  • [None][fix] fix llmapi import error by @crazydemo in #7030
  • [TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
  • [None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
  • [TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
  • [None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
  • [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
  • [None][fix] fix scaffolding dynasor test by @dc3671 in #7070
  • [None][chore] Update namelist in blossom-ci by @karljang in #7015
  • [None][ci] move unittests to sub-directories by @Funatiq in #6635
  • [None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
  • [None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
  • [TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
  • [None][chore] Only check the bindings lib for current build by @liji-nv in #7026
  • [None][ci] move some tests of b200 to post merge by @QiJune in #7093
  • [https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
  • [TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
  • [None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
  • [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
  • [None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
  • [None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
  • [None][chore] waive failed cases on H100 by @xinhe-nv in #7084
  • [None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
  • [https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
  • [https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
  • [None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
  • [https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
  • [https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
  • [None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
  • [https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
  • [None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
  • [#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
  • [None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
  • [None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
  • [None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
  • [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
  • [None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
  • [TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
  • [TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
  • [None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
  • [TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
  • [TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
  • [None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
  • [None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
  • [#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
  • [TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
  • [None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
  • [None][feat] Deepseek: Start Eag...
Read more

v1.1.0rc1

22 Aug 10:02
7334f93
Compare
Choose a tag to compare
v1.1.0rc1 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add Tencent HunYuanMoEV1 model support (#5521)
    • Support Yarn on Qwen3 (#6785)
  • API

    • BREAKING CHANGE: Introduce sampler_type, detect sampler according to options (#6831)
    • Introduce sampler options in trtllm bench (#6855)
    • Support accurate device iter time (#6906)
    • Add batch wait timeout in fetching requests (#6923)
  • Benchmark

    • Add accuracy evaluation for AutoDeploy (#6764)
    • Add accuracy test for context and generation workers with different models (#6741)
    • Add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
    • Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] (#6939)
    • Add NIM Related Cases Part 1 (#6684)
  • Feature

    • Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
    • Add single block version renormalized routing kernel (#6756)
    • Use Separate QKV Input Layout for Context MLA (#6538)
    • Enable accuracy test for MTP and chunked prefill (#6314)
  • Documentation

    • Update gpt-oss doc on MoE support matrix (#6908)
    • Modify the description for MLA chunked context (#6929)
    • Update wide-ep doc (#6933)
    • Update gpt oss doc (#6954)
    • Add more documents for large scale EP (#7029)
    • Add documentation for relaxed test threshold (#6997)

What's Changed

  • [https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
  • [https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
  • [None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
  • [None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
  • [None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
  • [https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
  • [https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
  • [None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
  • [None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
  • [None][fix] Fix perfect router. by @bobboli in #6797
  • [https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
  • [None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
  • [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
  • [None][doc] Modify the description for mla chunked context by @jmydurant in #6929
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
  • [None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
  • [https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
  • [https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
  • [None] [chore] Mamba cache in separate file by @tomeras91 in #6796
  • [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
  • [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
  • [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
  • [None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
  • [None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
  • [TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
  • [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
  • [None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
  • [TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
  • [None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
  • [None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
  • [None][feat] Support Yarn on Qwen3 by @byshiue in #6785
  • [None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
  • [None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
  • [https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
  • [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
  • [None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
  • [None][doc] Update gpt oss doc by @bobboli in #6954
  • [None] [feat] Support accurate device iter time by @kaiyux in #6906
  • [TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
  • [None] [fix] Fix the macro name by @ChristinaZ in #6983
  • [None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
  • [None][chore] Remove duplicate test waives by @yiqingy0 in #6998
  • [None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
  • [None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
  • [TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
  • [None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
  • [None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
  • [https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
  • [TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
  • [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
  • [None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
  • [TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
  • [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
  • [TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
  • [https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
  • [TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
  • [https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
  • [None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
  • [None][chore] Remove closed bugs by @xinhe-nv in #6969
  • [None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
  • [None] [doc] Add more documents for large scale EP by @kaiyux in #7029
  • [None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
  • [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
  • [https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
  • [https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
  • [None][chore] Remo...
Read more

v1.1.0rc0

16 Aug 00:09
26f413a
Compare
Choose a tag to compare
v1.1.0rc0 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add model gpt-oss (#6645)
    • Support Aggregate mode for phi4-mm (#6184)
    • Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
    • Support running heterogeneous model execution for Nemotron-H (#6866)
    • Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
  • API

    • BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
  • Benchmark

    • Improve Llama4 performance for small max_seqlen cases (#6306)
    • Multimodal benchmark_serving support (#6622)
    • Add perf-sweep scripts (#6738)
  • Feature

    • Support LoRA reload CPU cache evicted adapter (#6510)
    • Add FP8 context MLA support for SM120 (#6059)
    • Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
    • Include attention dp rank info with KV cache events (#6563)
    • Clean up ngram auto mode, add max_concurrency to configs (#6676)
    • Add NCCL Symmetric Integration for All Reduce (#4500)
    • Remove input_sf swizzle for module WideEPMoE (#6231)
    • Enable guided decoding with disagg serving (#6704)
    • Make fused_moe_cute_dsl work on blackwell (#6616)
    • Move kv cache measure into transfer session (#6633)
    • Optimize CUDA graph memory usage for spec decode cases (#6718)
    • Core Metrics Implementation (#5785)
    • Resolve KV cache divergence issue (#6628)
    • AutoDeploy: Optimize prepare_inputs (#6634)
    • Enable FP32 mamba ssm cache (#6574)
    • Support SharedTensor on MultimodalParams (#6254)
    • Improve dataloading for benchmark_dataset by using batch processing (#6548)
    • Store the block of context request into kv cache (#6683)
    • Add standardized GitHub issue templates and disable blank issues (#6494)
    • Improve the performance of online EPLB on Hopper by better overlapping (#6624)
    • Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
    • CUTLASS MoE FC2+Finalize fusion (#3294)
    • Add GPT OSS support for AutoDeploy (#6641)
    • Add LayerNorm module (#6625)
    • Support custom repo_dir for SLURM script (#6546)
    • DeepEP LL combine FP4 (#6822)
    • AutoTuner tuning config refactor and valid tactic generalization (#6545)
    • Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
    • Add support for Hopper MLA chunked prefill (#6655)
    • Helix: extend mapping to support different CP types (#6816)
  • Documentation

    • Remove the outdated features which marked as Experimental (#5995)
    • Add LoRA feature usage doc (#6603)
    • Add deployment guide section for VDR task (#6669)
    • Add doc for multimodal feature support matrix (#6619)
    • Move AutoDeploy README.md to torch docs (#6528)
    • Add checkpoint refactor docs (#6592)
    • Add K2 tool calling examples (#6667)
    • Add the workaround doc for H200 OOM (#6853)
    • Update moe support matrix for DS R1 (#6883)
    • BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

  • Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
  • [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
  • [None][chore] update readme for perf release test by @ruodil in #6664
  • [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
  • [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
  • [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
  • [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
  • [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
  • [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
  • [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
  • [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
  • [None][test] correct test-db context for perf yaml file by @ruodil in #6686
  • [None] [feat] Add model gpt-oss by @hlu1 in #6645
  • [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
  • [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
  • [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
  • [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
  • [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
  • [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
  • [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
  • [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
  • [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
  • [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
  • [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
  • [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
  • [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
  • [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
  • [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
  • [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
  • [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
  • [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
  • [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
  • [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
  • [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
  • [None][test] fix yml condition error under qa folder by @ruodil in #6734
  • [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
  • [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
  • [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
  • [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
  • [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
  • [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
  • [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
  • [None][fix]revert kvcache transfer by @chuangz0 in #6709
  • [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
  • [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
  • [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
  • [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
  • [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
  • [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
  • [None][feat] Core Metrics Implementation by @hcyezhang in #5785
  • [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
  • [TRTLLM-6637][feat]...
Read more

v1.0.0rc6

07 Aug 10:54
a16ba64
Compare
Choose a tag to compare
v1.0.0rc6 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

  • Feature

    • Add LoRA support for Gemma3 (#6371)
    • Add support of scheduling attention dp request (#6246)
    • Multi-block mode for Hopper spec dec XQA kernel (#4416)
    • LLM sleep & wakeup Part 1: virtual device memory (#5034)
    • best_of/n for pytorch workflow (#5997)
    • Add speculative metrics for trt llm bench (#6476)
    • (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
    • Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
    • check input tokens + improve error handling (#5170)
    • Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
    • Add vLLM KV Pool support for XQA kernel (#6013)
    • Switch to internal version of MMProjector in Gemma3 (#6572)
    • Enable fp8 SwiGLU to minimize host overhead (#6540)
    • Add Qwen3 MoE support to TensorRT backend (#6470)
    • ucx establish connection with zmq (#6090)
    • Disable add special tokens for Llama3.3 70B (#6482)
  • API

  • Benchmark

    • ADP schedule balance optimization (#6061)
    • allreduce benchmark for torch (#6271)
  • Documentation

    • Make example SLURM scripts more parameterized (#6511)
    • blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
    • Exposing the latest tech blogs in README.md (#6553)
    • update known issues (#6247)
    • trtllm-serve doc improvement. (#5220)
    • Adding GPT-OSS Deployment Guide documentation (#6637)
    • Exposing the GPT OSS model support blog (#6647)
    • Add llama4 hybrid guide (#6640)
    • Add DeepSeek R1 deployment guide. (#6579)
    • Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
  • Known Issues

    • On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

  • [fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
  • [TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
  • [fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
  • refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
  • chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
  • fix: Fix missing key by @zerollzeng in #6471
  • [https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
  • [TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
  • [https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
  • [None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
  • [None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
  • [None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
  • [None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
  • [None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
  • [https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
  • [TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
  • [https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
  • [None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
  • [AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
  • [TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
  • [TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
  • [None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
  • [None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
  • [None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
  • [None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
  • [TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
  • use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
  • [None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
  • [TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
  • [fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
  • chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
  • test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
  • [None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
  • [None][infra] update namelist by @niukuo in #6465
  • [https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
  • [None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
  • test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
  • [TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
  • [None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
  • [None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
  • [TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
  • [None][fix] remove closed bugs by @xinhe-nv in #6576
  • [None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
  • [None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
  • [None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
  • [None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
  • [TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
  • [None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
  • [None][test] update invalid test name by @crazydemo in #6596
  • [TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
  • [None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
  • [TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
  • [None][doc] Fix blog4 typo by @syuoni in #6612
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
  • [TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
  • [https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
  • [TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
  • [None][chore] Add readme for perf test by @ruodil in #6443
  • [https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
  • [None][chore] ucx establish connection with zmq by @chuangz0 in #6090
  • [TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
  • [None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
  • [TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in h...
Read more

v1.0.0rc5

04 Aug 09:45
fbee279
Compare
Choose a tag to compare
v1.0.0rc5 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
  • Feature
    • Deepseek R1 FP8 Support on Blackwell (#6486)
    • Auto-enable ngram with concurrency <= 32. (#6232)
    • Support turning on/off spec decoding dynamically (#6363)
    • Improve LoRA cache memory control (#6220)
    • Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
    • Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
    • Add support for external multimodal embeddings (#6263)
    • Add support for disaggregation with pp with pytorch backend (#6369)
    • Add _prepare_and_schedule_batch function in PyExecutor (#6365)
    • Add status tags to LLM API reference (#5707)
    • Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
    • Support JSON Schema in OpenAI-Compatible API (#6321)
    • Support chunked prefill on spec decode 2 model (#6104)
    • Enhance beam search support with CUDA graph integration (#6217)
    • Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
    • Add KV cache reuse support for multimodal models (#5444)
    • Multistream initial support for torch compile flow (#5847)
    • Support nanobind bindings (#6185)
    • Support Weight-Only-Quantization in PyTorch Workflow (#5850)
    • Support pytorch LoRA adapter eviction (#5616)
  • API
    • [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
  • Bug Fixes
    • fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
    • fix illeagel memory access in MLA (#6437)
    • Fix nemotronNAS loading for TP>1 (#6447)
    • Switch placement of image placeholder for mistral 3.1 (#6435)
    • Fix wide EP when using DeepEP with online EPLB (#6429)
    • Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
    • Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
    • Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
    • Fix PD + MTP + overlap scheduler accuracy issue (#6136)
    • Fix bug of Qwen3 when using fp4 on sm120 (#6065)
  • Benchmark
    • Fixes to parameter usage and low latency configuration. (#6343)
    • Add Acceptance Rate calculation to benchmark_serving (#6240)
  • Performance
    • Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
    • Optimize Mtp performance (#5689)
    • Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
    • Add non UB AR + Residual + Norm + Quant fusion (#6320)
  • Infrastructure
    • Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
    • Use build stage wheels to speed up docker release image build (#4939)
  • Documentation
    • Add README for wide EP (#6356)
    • Update Llama4 deployment guide: update config & note concurrency (#6222)
    • Add Deprecation Policy section (#5784)
  • Known Issues
    • If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
    • The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

Read more