📦 Packing with flash attn kwargs to avoid cross-contamination #3526

thepowerfuldeez · 2025-06-02T17:59:31Z

Requires huggingface/transformers#38536
Requires #3521

Modifies packing so that flash attention kernel is aware of sequence boundaries, which leads to improved sparsity and quality. Works only for SFT.

Loss plots:

(Blue is the new version, brown is the old one)

Achieved by this training config:

training_args = SFTConfig(
    run_name=args.run_name,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    learning_rate=1e-5,
    weight_decay=1e-7,
    max_grad_norm=1.0,
    num_train_epochs=3,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    max_length=args.max_length,
    output_dir=args.output_dir,
    padding_free=True,
    eval_steps=0.05,
    logging_steps=0.01,
    save_steps=0.15,
    save_total_limit=3,
    save_strategy="steps",
    eval_strategy="steps",
    completion_only_loss=True,
    eos_token="<|im_end|>",
    include_tokens_per_second=True,
    use_liger_kernel=True,
    report_to=["wandb"],
    dataloader_num_workers=8,
    packing=True,
    eval_packing=False,
    packing_strategy="ffd",
)

qgallouedec · 2025-06-02T18:35:56Z

Super cool! Can you change the base branch?

I think it doesn't requires transformers change if we play with signature columns, let check

…wargs

thepowerfuldeez · 2025-06-02T18:47:17Z

I missed the signature columns, so now it should work :) Updated plots, now the difference is visible.
I haven't found any data collators in trl, so I believe we would need to modify transformers in that case!

thepowerfuldeez · 2025-06-02T21:19:56Z

@qgallouedec needs to re-open with main branch now

…lash_attn_kwargs

qgallouedec · 2025-06-02T22:20:28Z

Wait, it wasn't automatically rebased? That's what usually happens.🤨

qgallouedec · 2025-06-02T22:45:14Z

trl/data_utils.py

@@ -478,6 +479,67 @@ def pack_examples(examples: dict[str, list[list]], seq_length: int) -> dict[str,
    return examples


+@numba.njit(["(int32[:], int32)", "(int64[:], int64)"], cache=True)


do you mind reverting this change? so that we keep PRs separate

actually that might not be that easy, your 2 PRs are quite intertwined right?

yep, but I will think about reverting, sorry that I haven't done that before
this should be possible

HuggingFaceDocBuilderDev · 2025-06-02T22:48:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-06-03T19:20:21Z

trl/data_utils.py

+            packed_columns.append(column)
+            packed_columns.append(sequence_lengths)
+        else:
+            packed_columns.append(column)


This should work as well?

Suggested change

packed_columns.append(column)

packed_columns.append(sequence_lengths)

else:

packed_columns.append(column)

packed_columns.append(sequence_lengths)

packed_columns.append(column)

actually it seems like sequence_lengths can be appended more than once, if you've more than one col that matches if pyarrow.types.is_list(column.type) or pyarrow.types.is_large_list(column.type): (which is almost always the case)

ah yea makes sense, and I also think ordering matters right? we could just ensure that we added this once and in the end, I guess?

qgallouedec · 2025-06-03T19:23:39Z

trl/trainer/sft_trainer.py

+        # For the packing case with FFD, we need to store sequence_length returned by the data collator with flattening
+        if self.args.packing and self.args.packing_strategy == "ffd" and self.args.padding_free:
+            self._signature_columns.append("sequence_length")


qgallouedec · 2025-06-05T00:11:34Z

I seems like it doesn't work, I'm not sure why exactly. It seems easier to directly build position_ids from packing I think, I'm trying this approach in another branch

qgallouedec · 2025-06-05T00:17:37Z

trl/trainer/sft_trainer.py

@@ -293,7 +293,7 @@ def __init__(
        if args.padding_free:
            if data_collator is not None:
                raise ValueError("Passing a custom data collator is not supported when using padding-free.")
-            if args.packing:
+            if args.packing and args.packing_strategy != "ffd":


Suggested change

if args.packing and args.packing_strategy != "ffd":

if args.packing:

actually padding_free is a different method (which is less relevant once once we have ffd with flash-attn).

You mean if we use packing with flash attn, we inherently already using padding free? If we pass position_ids
Probably we could change warning to say something like "when FFD packing is enabled, model accepts position_ids which works the same way as padding-free, so specifying padding_free makes no effect"

qgallouedec · 2025-06-05T00:20:58Z

trl/trainer/sft_trainer.py

+            if args.packing and model.config._attn_implementation != "flash_attention_2":
+                warnings.warn(
+                    "You are using packing with padding-free training, but the attention implementation is not set to "
+                    "'flash_attention_2'. Packing flattens batches into a single sequence, and 'flash_attention_2' is "
+                    "the only known attention mechanism that reliably supports this. Using other implementations may "
+                    "lead to unexpected behavior. To ensure compatibility, set `attn_implementation='flash_attention_2'` "
+                    "in the model configuration."
+                )
+            data_collator = DataCollatorWithFlattening(
+                return_flash_attn_kwargs=False,
+                return_position_ids=True,
+            )


I suggest dedenting this

not sure what would happen if user do not specifies flash attention 2 implementation - right now we just won't use padding free with position ids.
If we de-dent - we enable DataCollatorWithFlattening for any cases?

…or padding logic

thepowerfuldeez · 2025-06-05T13:29:30Z

Ah, I see the problem. After the recent Attention refactoring, all models still accept position_ids as an argument.
And even if we pass position_ids to model.forward, it doesn't go to FA2 directly.

After the investigation, here's the attention flow

Inside the model class we set attention interface as
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
for FA2 it calls flash_attention_forward inside transformers.integrations.flash_attention
flash_attention_forward do not accept position_ids directly, only through kwargs
flash_attention_forward calls _flash_attention_forward inside transformers.modelling_flash_attention_utils
If position_ids is provided here and attention_mask is not provided, we will generate padding-free masks: link

The problem, however, is that for some models attention_mask is always not None.
For example, for Qwen3, we pass attention_mask depending on attention type (full / sliding) and it's always casual attention mask.

I have a hack locally for going into flash_attn_varlen branch inside flash_attention_forward if we have position ids (so that position ids has priority over attention_mask).

thepowerfuldeez · 2025-06-05T13:30:04Z

In addition, adding sequence length is incorrect after resolving conflict @qgallouedec
I got

ArrowInvalid: Column 1 named sequence_length expected length 436 but got length 874

Reverting helped

qgallouedec · 2025-06-05T14:59:58Z

For reference: thepowerfuldeez#1

thepowerfuldeez · 2025-06-05T16:19:07Z

False alarm. I just tested training with aforementioned branch + transformers main, and we are calling flash_attention_varlen using position_ids. Nice and elegant solution by preparing position_ids during packing + flatten using existing data collator.

However, this approach uses more GPU memory, which is surprising

…o packing_with_flash_attn_kwargs

qgallouedec · 2025-06-06T03:20:08Z

Experiments

Packing vs no packing

3.7 x speedup

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/tldr", split="train[:2000]")

def concat(example):
    return{"text": example["prompt"]+ example["completion"]}
dataset = dataset.map(concat, batched=True, remove_columns=dataset.column_names)

packing = True
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=SFTConfig(
        model_init_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
        packing=packing,
        run_name="packing" if packing else "no_packing",
        gradient_accumulation_steps=1 if packing else 6,  # to get close num_tokens/opt steps 
    ),
    train_dataset=dataset,
)
trainer.train()

Flash attn kwargs vs no flash attn kwargs

The curves are very close, which seems normal to me.

The advantage here is that you avoid contamination between samples, which is not trivial to observe.

For this one, I just tweaked the code to remove position_ids.

qgallouedec · 2025-06-06T03:30:58Z

I'm realising that using position_ids when the attention implementation is not flash-attn hurts the result quite a lot. I'm adding a way use position_ids or not

qgallouedec · 2025-06-06T03:51:55Z

Now it looks good! (purple is the annoying setting where position_ids are passed, even when flash-attn is disable: not the case anymore after 9f4d9ee)

qgallouedec · 2025-06-06T04:06:09Z

Thanks a lot @thepowerfuldeez, I'll merge now, but feel free to make additional comments

jiosephlee · 2025-07-05T21:52:20Z

Hi @qgallouedec @thepowerfuldeez!

I just had a clarification on how packing is implemented with max_length. If there is a datum that can fit in only partially, will that be forced in with truncation, or will that belong to the next sequence? I know the non-packing max_length behavior is truncation. If it doesn't truncate, then how does packing handle whole datums that are longer than the set max_length?

qgallouedec and others added 6 commits June 1, 2025 20:52

new pack strat

c0925be

sft

8eef518

improve implementation, deprecated pack_examples and test

7e12779

fix test

22cad01

use numba for ffd packing

bfae377

update packing algorithm to support sequence_length

4a73543

thepowerfuldeez mentioned this pull request Jun 2, 2025

Update data collator to support sequence_length huggingface/transformers#38536

Closed

qgallouedec and others added 2 commits June 2, 2025 11:38

Merge branch 'main' into ffd_pack

d72c49b

fix signature columns and correct setting of return flash attention k…

9c8bc00

…wargs

thepowerfuldeez changed the base branch from main to ffd_pack June 2, 2025 18:49

Merge branch 'ffd_pack' into packing_with_flash_attn_kwargs

aae6b35

qgallouedec deleted the branch huggingface:main June 2, 2025 20:15

qgallouedec closed this Jun 2, 2025

thepowerfuldeez added 2 commits June 2, 2025 22:28

Merge branch 'main' of github.com:huggingface/trl into packing_with_f…

75196e1

…lash_attn_kwargs

fix merge

bb4951d

qgallouedec reopened this Jun 2, 2025

qgallouedec changed the base branch from ffd_pack to main June 2, 2025 22:25

Empty commit

a533912

qgallouedec reviewed Jun 2, 2025

View reviewed changes

revert numba related changes

bf04c38

thepowerfuldeez requested a review from qgallouedec June 3, 2025 11:50

qgallouedec reviewed Jun 3, 2025

View reviewed changes

thepowerfuldeez added 2 commits June 3, 2025 20:41

Merge branch 'main' into packing_with_flash_attn_kwargs

c3d4076

fix adding sequence lengths

8b6d5a9

position_ids in pack

15ed05a

qgallouedec reviewed Jun 5, 2025

View reviewed changes

qgallouedec and others added 3 commits June 5, 2025 00:38

collate with position_ids

b51a865

Add padding-free option to DataCollatorForLanguageModeling and refact…

a4c39ef

…or padding logic

signature columns must be position_ids

8cb93b2

thepowerfuldeez and others added 5 commits June 5, 2025 17:42

Merge branch 'position_ids_in_pack' of github.com:huggingface/trl int…

69f21ff

…o packing_with_flash_attn_kwargs

do not remove position_ids when using liger, add comment

ac131a5

improve collator tests

115137a

Drop DataCollatorWithFlattening

30fed95

clarify sft arg

a24c0ee

qgallouedec added 2 commits June 6, 2025 03:51

return_position_ids

b97159e

return_position_ids

9f4d9ee

qgallouedec changed the title ~~Packing with flash attn kwargs~~ 📦 Packing with flash attn kwargs to avoid cross-contamination Jun 6, 2025

qgallouedec approved these changes Jun 6, 2025

View reviewed changes

qgallouedec merged commit 7c8355d into huggingface:main Jun 6, 2025
10 checks passed

mariosasko mentioned this pull request Jun 25, 2025

Faster position_ids computation for FFD packing #3649

Merged

5 tasks

kylesayrs mentioned this pull request Jun 30, 2025

[Bugfix][Examples] drop data_collator causing failures in trl init vllm-project/llm-compressor#1606

Merged

LeonEricsson mentioned this pull request Jul 5, 2025

Which collator to use for padding-free packing: WithFlattening or Default LanguageModelling? #3692

Open

LeonEricsson mentioned this pull request Jul 6, 2025

[SFT] Add seq_lengths to signature columns #3699

Merged

5 tasks

		@@ -478,6 +479,67 @@ def pack_examples(examples: dict[str, list[list]], seq_length: int) -> dict[str,
		return examples


		@numba.njit(["(int32[:], int32)", "(int64[:], int64)"], cache=True)

	if args.packing and args.packing_strategy != "ffd":
	if args.packing:

📦 Packing with flash attn kwargs to avoid cross-contamination #3526

📦 Packing with flash attn kwargs to avoid cross-contamination #3526

Uh oh!

Conversation

thepowerfuldeez commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jun 2, 2025

Uh oh!

thepowerfuldeez commented Jun 2, 2025

Uh oh!

thepowerfuldeez commented Jun 2, 2025

Uh oh!

qgallouedec commented Jun 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Jun 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thepowerfuldeez commented Jun 5, 2025

Uh oh!

thepowerfuldeez commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jun 5, 2025

Uh oh!

thepowerfuldeez commented Jun 5, 2025

Uh oh!

qgallouedec commented Jun 6, 2025

Experiments

Packing vs no packing

Flash attn kwargs vs no flash attn kwargs

Uh oh!

qgallouedec commented Jun 6, 2025

Uh oh!

qgallouedec commented Jun 6, 2025

Uh oh!

qgallouedec commented Jun 6, 2025

Uh oh!

Uh oh!

jiosephlee commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thepowerfuldeez commented Jun 2, 2025 •

edited

Loading

qgallouedec Jun 3, 2025 •

edited

Loading

thepowerfuldeez commented Jun 5, 2025 •

edited

Loading

jiosephlee commented Jul 5, 2025 •

edited

Loading