Skip to content

Conversation

wili-65535
Copy link
Collaborator

Description

Previous PR3366 about DTM document, cherry-pick from PR3655.

A more detailed document about using Draft-Target-Model in Triton Inference Server.

Target of this PR:

  • Add steps to use symmetrical or asymmetrical TP size for draft and target engine.
  • More detailed steps to use Fast-Logits-D2D.
  • Steps to combine the two features above.
  • Note that Streaming mode or Batched-Request mode are not supported till the current version.

@wili-65535 wili-65535 changed the base branch from main to release/0.19 April 23, 2025 07:27
@wili-65535 wili-65535 requested a review from a team as a code owner April 23, 2025 07:27
@wili-65535 wili-65535 self-assigned this Apr 23, 2025
@wili-65535 wili-65535 added the Doc <NV>TRTLLM's textual/illustrative materials: API refs, guides, tutorials. Improvement & clarity. label Apr 23, 2025
@wili-65535 wili-65535 requested a review from litaotju April 23, 2025 07:33
@wili-65535 wili-65535 force-pushed the docs/dtm-release-0.19 branch from 8ecae79 to 25740c4 Compare April 23, 2025 07:39
@wili-65535 wili-65535 requested a review from achartier April 23, 2025 07:39
@wili-65535 wili-65535 force-pushed the docs/dtm-release-0.19 branch from 05647f2 to 63b42c4 Compare April 24, 2025 01:09
@litaotju
Copy link
Collaborator

Approving for the change, this is fine to submit to 0.19.

@wili-65535 pls let me know to merge it after you addressed all the comments from @achartier . Thx

@wili-65535 wili-65535 force-pushed the docs/dtm-release-0.19 branch from 63b42c4 to a085fe2 Compare April 24, 2025 03:03
Copy link
Collaborator

@achartier achartier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@achartier
Copy link
Collaborator

/bot skip --comment "Doc update"

@achartier achartier enabled auto-merge (squash) April 24, 2025 03:26
@tensorrt-cicd
Copy link
Collaborator

PR_Github #3244 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3244 [ skip ] completed with state SUCCESS
Skipping testing for commit a085fe2

@achartier achartier merged commit 33c4d49 into NVIDIA:release/0.19 Apr 24, 2025
3 checks passed
@wili-65535
Copy link
Collaborator Author

Thanks a lot @achartier , @litaotju !

@wili-65535 wili-65535 deleted the docs/dtm-release-0.19 branch April 28, 2025 06:27
DomBrown pushed a commit to DomBrown/TensorRT-LLM that referenced this pull request Apr 28, 2025
Signed-off-by: ZhanruiSunCh <[email protected]>

test: add test cases for 0.19 release (NVIDIA#3608)

* fix test name

Signed-off-by: Ivy Zhang <[email protected]>

* add quickstart test for nemotron-ultra

Signed-off-by: Ivy Zhang <[email protected]>

* add rcca multi-node test case for deepseek-v3

Signed-off-by: Ivy Zhang <[email protected]>

* add rcca info

Signed-off-by: Ivy Zhang <[email protected]>

---------

Signed-off-by: Ivy Zhang <[email protected]>
Signed-off-by: Ivy Zhang <[email protected]>

squash (NVIDIA#3642)

Signed-off-by: Enwei Zhu <[email protected]>

fix: nvbugs/5187237: fix deterministic mode crash (NVIDIA#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive
Signed-off-by: Xiwen Yu <[email protected]>

* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.

Signed-off-by: Xiwen Yu <[email protected]>

* revert ar fusion

Signed-off-by: Xiwen Yu <[email protected]>

---------

Signed-off-by: Xiwen Yu <[email protected]>

update fp8 doc (NVIDIA#3647)

Signed-off-by: taoli <[email protected]>
Co-authored-by: taoli <[email protected]>

tests: change qa perf test to trtllm-bench (NVIDIA#3619)

Signed-off-by: Ruodi <[email protected]>
Co-authored-by: Larry <[email protected]>

 fix: FP8 quantized lm_head (NvBug 5214229) (NVIDIA#3567)

Signed-off-by: Enwei Zhu <[email protected]>

infra: Add PR approval protection for the release branch (NVIDIA#3634)

Signed-off-by: Yanchao Lu <[email protected]>

fix: nvbugs/5231298: pytorch allreduce issue (NVIDIA#3673)

Signed-off-by: Xiwen Yu <[email protected]>

Fix: nvbugs/5222698 variable not defined (NVIDIA#3630)

* Fix: nvbugs/5222698 variable not defined

Signed-off-by: Zongfei Jing <[email protected]>

* Tidy code

Signed-off-by: Zongfei Jing <[email protected]>

---------

Signed-off-by: Zongfei Jing <[email protected]>

test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (NVIDIA#3685)

Signed-off-by: nv-guomingz <[email protected]>

test:restore fp8 kv cache testing for L0 (NVIDIA#3671)

Signed-off-by: nv-guomingz <[email protected]>

doc: Update DeepSeek perf docs (NVIDIA#3693)

* Update DeepSeek perf docs

Signed-off-by: Kaiyu Xie <[email protected]>

* update

Signed-off-by: Kaiyu Xie <[email protected]>

* Apply suggestions from code review

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]>

---------

Signed-off-by: Kaiyu Xie <[email protected]>
Co-authored-by: Copilot <[email protected]>

tests: waive test_llm_multi_node (NVIDIA#3664)

Signed-off-by: junq <[email protected]>

fix: update test_user_buffers_mm_add_prologue atol (NVIDIA#3711)

Signed-off-by: Jin Li <[email protected]>

Fix: cherry-pick hmac encryption from main branch (NVIDIA#3635)

* security fix cherry-pick changes from main

Signed-off-by: Yibin Li <[email protected]>

* fix hmac in remote mpi session (NVIDIA#3649)

Signed-off-by: Yan Chunwei <[email protected]>

---------

Signed-off-by: Yibin Li <[email protected]>
Signed-off-by: Yan Chunwei <[email protected]>
Co-authored-by: Yan Chunwei <[email protected]>

Un-waive DS-V3-Lite tests. (NVIDIA#3621)

Signed-off-by: Tracin <[email protected]>

fix: FP8 kv accuracy (NVIDIA#3675)

* fix FP8 kv accuracy

Signed-off-by: Dylan Chen <[email protected]>

* update doc

Signed-off-by: Dylan Chen <[email protected]>

---------

Signed-off-by: Dylan Chen <[email protected]>

Fix script options for engines. (NVIDIA#3622)

Signed-off-by: Tracin <[email protected]>

unwaive multi-node test (NVIDIA#3721)

Signed-off-by: Superjomn <[email protected]>

chore : Split more tests out of gpt tests (NVIDIA#3524) (NVIDIA#3674)

Signed-off-by: peaceh <[email protected]>

doc:add torch examples link into torch backend documentation (NVIDIA#3749)

Signed-off-by: nv-guomingz <[email protected]>
Co-authored-by: nv-guomingz <[email protected]>

test: Get Eagle tests working (NVIDIA#3593) (NVIDIA#3722)

Signed-off-by: Balaram Buddharaju <[email protected]>
Co-authored-by: brb-nv <[email protected]>

Waive L0 test (NVIDIA#3756)

Signed-off-by: Yiqing Yan <[email protected]>

waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (NVIDIA#3656)

Signed-off-by: Ruodi <[email protected]>
Signed-off-by: Larry <[email protected]>
Co-authored-by: Larry <[email protected]>

Update ds v3 parameters in stress test. (NVIDIA#3676)

waive gemma on L20 (NVIDIA#3766)

Signed-off-by: Ivy Zhang <[email protected]>

https://nvbugs/5141291: Fix convert.py script for Qwen model. (NVIDIA#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.

Signed-off-by: Yukun He <[email protected]>

fix: PP4 fixes and cleanup (NVIDIA#3688)

Signed-off-by: Anurag Mukkara <[email protected]>
Co-authored-by: Sharan Chetlur <[email protected]>

remove benchmark test list (NVIDIA#3643)

Signed-off-by: Ivy Zhang <[email protected]>

skip disagg deepseek test if sm!=90 (NVIDIA#3720)

Signed-off-by: Chuang Zhu <[email protected]>

test: skip failed cases on B200 (NVIDIA#3710)

* add skip condition to tests

Signed-off-by: xinhe-nv <[email protected]>

* fix error

Signed-off-by: xinhe-nv <[email protected]>

---------

Signed-off-by: xinhe-nv <[email protected]>

test: [nvbug: 5234494] skip_pre_ada for fp8 cases (NVIDIA#3718)

* skip_pre_ada for fp8 cases

Signed-off-by: Ivy Zhang <[email protected]>

* update

Signed-off-by: Ivy Zhang <[email protected]>

* update after rebase

Signed-off-by: Ivy Zhang <[email protected]>

---------

Signed-off-by: Ivy Zhang <[email protected]>

add know issue to deepseek doc. (NVIDIA#3800)

Signed-off-by: Fanrong Li <[email protected]>

Fix ModelOpt Mixtral AWQ OOM (NVIDIA#3714) (NVIDIA#3761)

Signed-off-by: Barry Kang <[email protected]>
Co-authored-by: Larry <[email protected]>

Waive L0 tests (NVIDIA#3826)

Signed-off-by: Yiqing Yan <[email protected]>

fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (NVIDIA#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.

Signed-off-by: Yukun He <[email protected]>

* Fix fused_moe fallback issue. (NVIDIA#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.

Signed-off-by: Yukun He <[email protected]>

---------

Signed-off-by: Yukun He <[email protected]>

[doc] Better document for Draft-Target-Model (DTM) speculative decoding (NVIDIA#3797)

Signed-off-by: wili-65535 <[email protected]>
Signed-off-by: Dom Brown <[email protected]>

Fix pre-commit

Signed-off-by: Dom Brown <[email protected]>

Fix again

Signed-off-by: Dom Brown <[email protected]>

Address some review comments for the MI

Signed-off-by: Dom Brown <[email protected]>
dcampora pushed a commit that referenced this pull request Apr 29, 2025
test: add test cases for 0.19 release (#3608)

* fix test name



* add quickstart test for nemotron-ultra



* add rcca multi-node test case for deepseek-v3



* add rcca info



---------




squash (#3642)



fix: nvbugs/5187237: fix deterministic mode crash (#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive


* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.



* revert ar fusion



---------



update fp8 doc (#3647)




tests: change qa perf test to trtllm-bench (#3619)




 fix: FP8 quantized lm_head (NvBug 5214229) (#3567)



infra: Add PR approval protection for the release branch (#3634)



fix: nvbugs/5231298: pytorch allreduce issue (#3673)



Fix: nvbugs/5222698 variable not defined (#3630)

* Fix: nvbugs/5222698 variable not defined



* Tidy code



---------



test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)



test:restore fp8 kv cache testing for L0 (#3671)



doc: Update DeepSeek perf docs (#3693)

* Update DeepSeek perf docs



* update



* Apply suggestions from code review




---------




tests: waive test_llm_multi_node (#3664)



fix: update test_user_buffers_mm_add_prologue atol (#3711)



Fix: cherry-pick hmac encryption from main branch (#3635)

* security fix cherry-pick changes from main



* fix hmac in remote mpi session (#3649)



---------





Un-waive DS-V3-Lite tests. (#3621)



fix: FP8 kv accuracy (#3675)

* fix FP8 kv accuracy



* update doc



---------



Fix script options for engines. (#3622)



unwaive multi-node test (#3721)



chore : Split more tests out of gpt tests (#3524) (#3674)



doc:add torch examples link into torch backend documentation (#3749)




test: Get Eagle tests working (#3593) (#3722)




Waive L0 test (#3756)



waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)





Update ds v3 parameters in stress test. (#3676)

waive gemma on L20 (#3766)



https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.



fix: PP4 fixes and cleanup (#3688)




remove benchmark test list (#3643)



skip disagg deepseek test if sm!=90 (#3720)



test: skip failed cases on B200 (#3710)

* add skip condition to tests



* fix error



---------



test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)

* skip_pre_ada for fp8 cases



* update



* update after rebase



---------



add know issue to deepseek doc. (#3800)



Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)




Waive L0 tests (#3826)



fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.



* Fix fused_moe fallback issue. (#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.



---------



[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)




Fix pre-commit



Fix again



Address some review comments for the MI

Signed-off-by: Dom Brown <[email protected]>
Co-authored-by: Zhanrui Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Doc <NV>TRTLLM's textual/illustrative materials: API refs, guides, tutorials. Improvement & clarity.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants