Ensure Chat Template Safe Prompt Truncation #3646

pramodith · 2025-06-25T11:14:22Z

What does this PR do?

The GRPOTrainer supports truncating the prompt based on the max_prompt_length configuration. Currently prompt truncation happens via

https://github.com/pramodith/trl/blob/79ec242aefedc108de9edbea62be6d95070fde03/trl/trainer/grpo_trainer.py#L1083-L1087

However if the prompt is in chatML format, a naive truncation will break the template leading to poor training results.

This PR addresses this problem by doing the following:

Computes the length of each turn in a conversation excluding any special tokens or tokens added by apply_chat_template.
Creates a budget of the number of tokens that need to be truncated.
Iterates through each turn and determines if the turn should be completely excluded, truncated or included without truncation.
We will retain the last k turns whose contents add up to max_prompt_length

Since this approach only prunes/truncates tokens in the "content" field of a chat template message, we guarantee the preservation of the chat template.

However the downside of this approach is that the max_prompt_length isn't strictly adhered to since we don't account for any tokens that apply_chat_template would introduce such as <|im_start|>role, \n and any other things like a default system prompt.

I think that this is a fair trade off though. I did initially try out another approach where we could also account for any tokens introduced by the chat template but there were far too many edge cases to consider.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@LeonEricsson @qgallouedec
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…uncation

LeonEricsson · 2025-06-25T21:19:24Z

Do you plan to split the skip_special_tokens=False fix into a separate PR? I think it makes sense to separate them – this way, we can get the fix merged as soon as possible.

pramodith · 2025-06-25T21:23:59Z

Cool, I'll have a new PR with just skip_special_tokens=False in the next hour.

pramodith · 2025-06-27T11:18:40Z

@LeonEricsson this one is ready for review now. I've mentioned some of the tradeoffs this solution entails in the description of the PR.

LeonEricsson

I tested running with the following conversational dataset:

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

"""
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]  # type: ignore
    data = data.map(
        lambda x: {  # type: ignore
            "prompt": [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": x["question"]}],
            "answer": extract_hash_answer(x["answer"]),
        }
    )  # type: ignore
    return data  # type: ignore

and the prompts_text returned by _get_prompt_inputs had the wrong system prompt. They contained the tokenizer's default system prompt instead.

LeonEricsson · 2025-07-01T08:06:46Z

trl/trainer/grpo_config.py

+            "prompt is conversational, we only truncate the message tokens starting from the top of the conversation
+            "and do not account for any tokens introduced


Suggested change

"prompt is conversational, we only truncate the message tokens starting from the top of the conversation

"and do not account for any tokens introduced

"prompt is conversational, we only truncate the message tokens starting from the top of the conversation "

"and do not account for any tokens introduced "

LeonEricsson · 2025-07-01T08:20:02Z

trl/trainer/grpo_trainer.py

+            prompt_inputs = super()._prepare_inputs(prompt_inputs)
+            prompt_ids, prompt_mask = prompt_inputs["input_ids"], prompt_inputs["attention_mask"]
+
+            if self.max_prompt_length is not None and prompt_mask.sum(-1).max() > self.max_prompt_length:


could you motivate the need for prompt_mask.sum(-1).max() > self.max_prompt_length? is it to avoid unnecessary decode if we don't need to truncate?

Yes exactly, gets rid of any redundant ops.

Got it. Given we've already padded, won't prompt_mask.sum(-1).max() always equal prompt_ids.shape[-1] or prompt_mask.shape[-1] (we're doing 'longest' padding - pad to the longest sequence in the batch)?

Yeah that's fair. I'll change that to just use the shape.

trl/trainer/grpo_trainer.py

LeonEricsson · 2025-07-01T11:05:43Z

trl/trainer/grpo_trainer.py

+                prompt_inputs = self.processing_class.apply_chat_template(
+                    truncated_messages, return_dict=True, add_generation_prompt=True
+                )
+                prompt_inputs = super()._prepare_inputs(prompt_inputs)
+                prompt_ids = prompt_inputs["input_ids"]
+                prompt_mask = prompt_inputs["attention_mask"]
+                prompt_text = self.processing_class.batch_decode(prompt_ids, skip_special_tokens=False)


this should be outside/after the for-loop, right?

Yep, good catch.

pramodith · 2025-07-01T11:36:32Z

I tested running with the following conversational dataset:

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

"""
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]  # type: ignore
    data = data.map(
        lambda x: {  # type: ignore
            "prompt": [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": x["question"]}],
            "answer": extract_hash_answer(x["answer"]),
        }
    )  # type: ignore
    return data  # type: ignore

and the prompts_text returned by _get_prompt_inputs had the wrong system prompt. They contained the tokenizer's default system prompt instead.

What was the max_prompt_length configured to? Some chat templates automatically introduce the default system prompt if the system message is missing in the messages for a given batch. This happens when running the apply_chat_template function.

So I'm assuming that the max_prompt_length that was set for this example eliminated the custom system prompt. Now that I think about it this means that any custom system prompt will always be pruned. Might make more sense for us to start pruning from the first user message instead of starting from the system prompt and only prune from the system prompt in case all turns barring the last user message was pruned.

But I suppose that would make this approach inconsistent with how the non-chat prompt truncation is handled. Let me know what you think the best approach is.

LeonEricsson · 2025-07-02T07:44:25Z

What was the max_prompt_length configured to? Some chat templates automatically introduce the default system prompt if the system message is missing in the messages for a given batch. This happens when running the apply_chat_template function.

yes good catch, i believe this was indeed the case.

So I'm assuming that the max_prompt_length that was set for this example eliminated the custom system prompt. Now that I think about it this means that any custom system prompt will always be pruned. Might make more sense for us to start pruning from the first user message instead of starting from the system prompt and only prune from the system prompt in case all turns barring the last user message was pruned.

But I suppose that would make this approach inconsistent with how the non-chat prompt truncation is handled. Let me know what you think the best approach is.

tbh, I'm not entirely sure what the correct approach is here. I feel like user content should indeed be truncated first. However, I agree that we're moving into territory where handling conversational versus non-conversational formats significantly diverges.

pramodith · 2025-07-02T10:02:00Z

tbh, I'm not entirely sure what the correct approach is here. I feel like user content should indeed be truncated first. However, I agree that we're moving into territory where handling conversational versus non-conversational formats significantly diverges.

Do we want to rope in someone else for a third opinion? I don't think there's a right approach/answer here prompt truncation is going to always be less than ideal in terms of how it affects training and one can make a case for either approach.

pramodith · 2025-07-08T16:30:48Z

@qgallouedec can you please give a third opinion on this PR?

qgallouedec · 2025-07-12T16:01:26Z

Sorry for the late reply. I'm slowly catching up after my vacation
I agree with @LeonEricsson, I understand the intuition of not breaking the chat template, but do we have solid results to validate this?:

a naive truncation will break the template leading to poor training results.

And are there any results to show that this approach gives significantly better results? I've never seen a paper or lib using this approach, do you know of any?

pramodith · 2025-07-12T17:29:48Z

And are there any results to show that this approach gives significantly better results? I've never seen a paper or lib using this approach, do you know of any?

That's a fair question, its perhaps necessary to quantify the impact of this. I can create a colab notebook to test out the results of a model trained on a dataset where 20% of the prompts are truncated. We can then compare the results on a test set when preserving and not preserving the chat template.

Are there any datasets that you'd recommend for this and what model should I be training, is Qwen3-0.6B good enough or do we want to train a larger model?

LeonEricsson · 2025-07-13T09:37:53Z

We're left truncating, so cutting the system prompt first and foremost. We should see this impact training, given the importance of the system prompt in common GRPO applications.

Are there any datasets that you'd recommend for this and what model should I be training, is Qwen3-0.6B good enough or do we want to train a larger model?

How about open-r1/DAPO-Math-17k-Processed with Qwen3 1.7B

Should not skip special tokens.

4f2be24

pramodith marked this pull request as draft June 25, 2025 11:15

pramodith and others added 3 commits June 25, 2025 15:02

Create a new function to truncate tokens.

45fd454

Merge branch 'huggingface:main' into pramodith/fix_decoding_during_tr…

734a311

…uncation

New implementation.

3ab9a48

LeonEricsson mentioned this pull request Jun 25, 2025

Revert "[GRPO] Fix prompt truncation (max_prompt_length) with vLLM." #3650

Closed

5 tasks

LeonEricsson mentioned this pull request Jun 26, 2025

Make sure chat template isn't lost when truncating prompt. #3651

Merged

5 tasks

pramodith and others added 5 commits June 26, 2025 10:45

Add a utility function to find the first or last occurence.

995628d

Handle all edge cases of truncation.

e887190

This approach makes more sense.

c6176d8

Test cases pass.

660cddb

Allow bf16 for test case.

763b86a

pramodith changed the title ~~Prompt Decoding after truncation should not skip special tokens~~ Ensure Chat Template Safe Prompt Truncation Jun 27, 2025

pramodith marked this pull request as ready for review June 27, 2025 11:17

Merge branch 'main' into pramodith/fix_decoding_during_truncation

9d3383c

pramodith added 2 commits June 27, 2025 11:24

Update docs.

26141c1

Merge branch 'main' into pramodith/fix_decoding_during_truncation

1a9b6f4

LeonEricsson reviewed Jul 1, 2025

View reviewed changes

pramodith added 3 commits July 1, 2025 11:45

Address comments.

9008389

Merge branch 'main' into pramodith/fix_decoding_during_truncation

f0d50c7

Move final prompt property extraction outside for.

9359107

pramodith and others added 2 commits July 2, 2025 10:43

Merge branch 'main' into pramodith/fix_decoding_during_truncation

c0821af

undo faulty merge.

7f09ad0

pramodith closed this Aug 6, 2025

		"prompt is conversational, we only truncate the message tokens starting from the top of the conversation
		"and do not account for any tokens introduced

Ensure Chat Template Safe Prompt Truncation #3646

Ensure Chat Template Safe Prompt Truncation #3646

Uh oh!

Conversation

pramodith commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Jun 25, 2025

Uh oh!

pramodith commented Jun 25, 2025

Uh oh!

pramodith commented Jun 27, 2025

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pramodith Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeonEricsson Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonEricsson commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Jul 2, 2025

Uh oh!

pramodith commented Jul 8, 2025

Uh oh!

qgallouedec commented Jul 12, 2025

Uh oh!

pramodith commented Jul 12, 2025

Uh oh!

LeonEricsson commented Jul 13, 2025

Uh oh!

Uh oh!

pramodith commented Jun 25, 2025 •

edited

Loading

LeonEricsson Jul 1, 2025 •

edited

Loading

pramodith commented Jul 1, 2025 •

edited

Loading

LeonEricsson commented Jul 2, 2025 •

edited

Loading