Skip to content

Conversation

mikeiovine
Copy link
Collaborator

@mikeiovine mikeiovine commented Jul 16, 2025

Description

This PR adds chunked prefill support to the 2-model spec decode flow. In this design, prefill chunks are sent to the draft model immediately after they are processed by the target.

One consequence of this setup is that we'll have to load the draft model on prefill workers for disagg scenarios.

Test Coverage

Added new unit test for both one model and 2 model. Manually verified that AR is the same on a set of long prompts after enabling chunked prefill.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features
    • Improved handling of chunked prefill processing for draft models, allowing more efficient and synchronized processing of input chunks.
  • Enhancements
    • Added tracking of the last processed context chunk for each request, ensuring better management of draft requests and token generation during chunked prefill scenarios.
  • Tests
    • Extended tests to cover chunked prefill scenarios with varied prompt inputs and token limits.

Copy link
Contributor

coderabbitai bot commented Jul 16, 2025

Walkthrough

The changes introduce a new attribute, py_last_context_chunk, to track context chunk boundaries in the LlmRequest class and update it during request state transitions. The speculative draft model logic is extended to handle chunked prefill scenarios, ensuring correct draft request management and synchronization between target and draft models. The test suite is also enhanced to cover chunked prefill cases.

Changes

File(s) Change Summary
tensorrt_llm/_torch/pyexecutor/llm_request.py Added py_last_context_chunk attribute (initialized as (None, None)) to LlmRequest class constructor.
tensorrt_llm/_torch/pyexecutor/py_executor.py Updated _update_request_states_tp to set py_last_context_chunk for each context request before chunk advance.
tensorrt_llm/_torch/speculative/model_drafter.py Enhanced draft model logic to handle chunked prefill, using py_last_context_chunk for chunk tracking and sync; renamed _create_chunked_context_request to _create_accepted_tokens_request.
tests/unittest/_torch/speculative/test_eagle3.py Extended test_llama_eagle3 to parameterize and test chunked prefill scenarios with adjusted prompts and configs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PyExecutor
    participant LlmRequest
    participant ModelDrafter

    User->>PyExecutor: Submit request
    PyExecutor->>LlmRequest: Initialize (py_last_context_chunk = (None, None))
    loop For each context chunk
        PyExecutor->>LlmRequest: Update py_last_context_chunk (start, end)
        PyExecutor->>ModelDrafter: Prepare draft batch (with chunk info)
        ModelDrafter->>LlmRequest: Create/Update context request with chunk info
        ModelDrafter->>ModelDrafter: Process decoded tokens (synchronize with target)
    end
Loading

Estimated code review effort

2 (~20 minutes)

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • HuiGao-NV
  • yilin-void
  • qiaoxj07

Poem

In the warren of code, a chunk hops anew,
Tracking its journey, from start point to through.
Drafts now aligned, in perfect prefill,
Synchrony hopping, with rabbit-like skill.
Each chunk accounted, no tokens astray—
The LLM’s request hops smarter today! 🐇


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2381142 and 875fba9.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@mikeiovine mikeiovine changed the title [feat] Support chunked prefill on spec decode 2 model [TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model Jul 16, 2025
@mikeiovine mikeiovine force-pushed the chunked-prefill-spec-dec branch 5 times, most recently from 840fb61 to 95ab84c Compare July 17, 2025 16:48
@mikeiovine mikeiovine requested a review from ziyixiong-nv July 17, 2025 16:50
@mikeiovine mikeiovine marked this pull request as ready for review July 17, 2025 16:51
@mikeiovine mikeiovine requested review from a team as code owners July 17, 2025 16:51
@mikeiovine
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)

78-90: Address line length violation and improve prompt readability.

The long prompt string on line 81 exceeds the 120 character limit flagged by static analysis.

-        prompts = [
-            "The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must-visit destination for anyone who loves history, architecture, and culture. From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\nThe city is divided into 20 arrondissements, each with its own unique character and charm. The Latin Quarter is a popular area for students and young travelers, while the Champs-Élysées is a hub for shopping and dining. The Montmartre neighborhood is famous for its bohemian vibe and stunning views of the city.\nParis is also known for its beautiful parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. The city has a rich history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. Visitors can also explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\nIn addition to its cultural and historical attractions, Paris is also a great destination for foodies. The city is famous for its cuisine, including croissants, baguettes, and cheese. Visitors can sample the city's famous dishes at one of the many restaurants, cafes, and "
-        ]
+        prompts = [
+            ("The capital of France is a city of romance, art, fashion, and cuisine. "
+             "Paris is a must-visit destination for anyone who loves history, architecture, and culture. "
+             "From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer "
+             "for every interest and age.\nThe city is divided into 20 arrondissements, each with its own "
+             "unique character and charm. The Latin Quarter is a popular area for students and young travelers, "
+             "while the Champs-Élysées is a hub for shopping and dining. The Montmartre neighborhood is famous "
+             "for its bohemian vibe and stunning views of the city.\nParis is also known for its beautiful "
+             "parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. The city has a rich "
+             "history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. Visitors can also "
+             "explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\nIn addition "
+             "to its cultural and historical attractions, Paris is also a great destination for foodies. The "
+             "city is famous for its cuisine, including croissants, baguettes, and cheese. Visitors can sample "
+             "the city's famous dishes at one of the many restaurants, cafes, and ")
+        ]
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 840fb61 and 95ab84c.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (5 hunks)
  • tests/unittest/_torch/speculative/test_eagle3.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tensorrt_llm/llmapi/llm.py (2)
  • tokenizer (657-661)
  • tokenizer (664-665)
tests/unittest/llmapi/test_llm.py (1)
  • encode (308-309)
🪛 Ruff (0.12.2)
tests/unittest/_torch/speculative/test_eagle3.py

81-81: Line too long (1197 > 120)

(E501)

🔇 Additional comments (7)
tests/unittest/_torch/speculative/test_eagle3.py (2)

17-31: Test coverage for chunked prefill looks comprehensive.

The parametrize decorator appropriately adds enable_chunked_prefill parameter with test cases covering both chunked and non-chunked scenarios across different configurations.


62-66: Configuration for chunked prefill is correctly implemented.

The conditional logic properly enables chunked prefill and reduces max_num_tokens to 64 to trigger the chunked prefill code path, which aligns with the test objectives.

tensorrt_llm/_torch/speculative/model_drafter.py (5)

79-88: Context request creation properly handles chunked prefill boundaries.

The method correctly extracts chunk boundaries from py_last_context_chunk and sets context_current_position and context_chunk_size appropriately for chunked prefill scenarios.


103-118: Method rename improves clarity and maintains correct logic.

The rename from _create_chunked_context_request to _create_accepted_tokens_request better describes the method's purpose. The logic for handling accepted tokens in chunked context remains correct.


180-194: Chunked prefill handling in draft batch preparation is well-implemented.

The logic correctly:

  • Skips requests with context_current_position == 0 (still need target model processing)
  • Handles chunked prefill by reconstructing input tokens and creating context requests
  • Properly integrates with the existing draft batch workflow

285-289: Token processing correctly defers draft token addition for chunked prefill.

The logic appropriately checks if the target model request is not in GENERATION_IN_PROGRESS state and defers adding draft tokens until the entire prompt is processed, while properly freeing resources.


142-143: Method call update aligns with the renamed method.

The call to _create_accepted_tokens_request correctly reflects the method rename and maintains the same parameters.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12222 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12222 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9076 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12227 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12227 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9081 completed with status: 'FAILURE'

@mikeiovine mikeiovine requested a review from lfr-0531 July 18, 2025 15:43
@mikeiovine mikeiovine force-pushed the chunked-prefill-spec-dec branch from a8c5d0b to 92b4a83 Compare July 18, 2025 16:17
@mikeiovine
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)

78-90: Fix the line length violation for better readability.

The prompt selection logic is well-implemented and appropriate for testing chunked prefill functionality. However, the long prompt string on line 81 exceeds the 120-character limit.

Consider breaking the long prompt into multiple lines for better readability:

-            "The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must-visit destination for anyone who loves history, architecture, and culture. From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\nThe city is divided into 20 arrondissements, each with its own unique character and charm. The Latin Quarter is a popular area for students and young travelers, while the Champs-Élysées is a hub for shopping and dining. The Montmartre neighborhood is famous for its bohemian vibe and stunning views of the city.\nParis is also known for its beautiful parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. The city has a rich history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. Visitors can also explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\nIn addition to its cultural and historical attractions, Paris is also a great destination for foodies. The city is famous for its cuisine, including croissants, baguettes, and cheese. Visitors can sample the city's famous dishes at one of the many restaurants, cafes, and "
+            ("The capital of France is a city of romance, art, fashion, and cuisine. "
+             "Paris is a must-visit destination for anyone who loves history, architecture, and culture. "
+             "From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\n"
+             "The city is divided into 20 arrondissements, each with its own unique character and charm. "
+             "The Latin Quarter is a popular area for students and young travelers, while the Champs-Élysées is a hub for shopping and dining. "
+             "The Montmartre neighborhood is famous for its bohemian vibe and stunning views of the city.\n"
+             "Paris is also known for its beautiful parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. "
+             "The city has a rich history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. "
+             "Visitors can also explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\n"
+             "In addition to its cultural and historical attractions, Paris is also a great destination for foodies. "
+             "The city is famous for its cuisine, including croissants, baguettes, and cheese. "
+             "Visitors can sample the city's famous dishes at one of the many restaurants, cafes, and ")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8c5d0b and 92b4a83.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (5 hunks)
  • tests/unittest/_torch/speculative/test_eagle3.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tests/unittest/_torch/speculative/test_eagle3.py

81-81: Line too long (1197 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/unittest/_torch/speculative/test_eagle3.py (3)

16-27: LGTM! Comprehensive test coverage for chunked prefill feature.

The parameterization correctly adds the new enable_chunked_prefill parameter and includes test cases for both single-model and two-model scenarios with chunked prefill enabled. The existing test cases are preserved to maintain backward compatibility.


31-31: Function signature properly updated.

The function signature correctly includes the new enable_chunked_prefill parameter with proper type annotation.


62-66: Well-implemented chunked prefill configuration.

The configuration correctly enables chunked prefill and sets max_num_tokens to 64 to ensure the chunked prefill code path is exercised during testing. The comment provides clear context for this choice.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12330 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12330 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9160 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12345 [ run ] triggered by Bot

@mikeiovine
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12719 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12719 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9466 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12740 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12740 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9485 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12866 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12866 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9589 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12888 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12888 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9608 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@mikeiovine mikeiovine merged commit 0f2f11f into NVIDIA:main Jul 25, 2025
3 checks passed
@mikeiovine mikeiovine deleted the chunked-prefill-spec-dec branch July 25, 2025 01:50
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants