📉 FFD packing #3521

qgallouedec · 2025-06-01T23:32:29Z

This PR introduces a new packing strategy, FFD (First Fit Decreasing).
Advantages:

FFD preserves the integrity of individual sequences by never cutting them in the middle, maintaining natural structure and context flow.
Drawbacks:
FFD is slower than fixed packing (the one currently used), but the speed remain very reasonable
It achieves slightly worse padding efficiency, though the padding overhead is typically minimal.

	Fixed	FFD
Fast?	Very 🤗	Yes 😊
Tokens exceeding `max_length` discarded	No 😊	Yes 🫤
Cut in the middle	Yes 🫤	No 😊

Benchmark

Speed

Time to pack a dataset containing 100k rows (hardly correlated to max_length)

Fixed strategy: 0.3155 seconds
FFD strategy: 10.3634 seconds

So it's way slower (~30 times) but still very reasonable

Code used

import timeit
import numpy as np
from datasets import Dataset
from trl.data_utils import pack_dataset

# Create a larger dataset with sequence lengths following a gamma distribution
num_samples = 100_000

# Generate sequence lengths following a gamma distribution
seq_lengths = np.random.gamma(shape=5, scale=20, size=num_samples) # mean will be 100
seq_lengths = np.clip(seq_lengths, 10, None).astype(int)  # Clip to [10, inf)

# Generate input sequences with random lengths based on gamma distribution
examples = {
    "input_ids": [list(range(length)) for length in seq_lengths],
    "attention_mask": [[1] * length for length in seq_lengths],
}

dataset = Dataset.from_dict(examples)
max_length = 256  # Set a fixed packing length

# Benchmark pack_dataset for both strategies
time_pack_dataset_ffd = timeit.timeit(lambda: pack_dataset(dataset, max_length, strategy="ffd"), number=5)
time_pack_dataset_fixed = timeit.timeit(lambda: pack_dataset(dataset, max_length, strategy="fixed"), number=5)

# Plot the comparison
print(f"Time for 100k rows with FFD strategy: {time_pack_dataset_ffd:.4f} seconds")
print(f"Time for 100k rows with Fixed strategy: {time_pack_dataset_fixed:.4f} seconds")

Padding tokens efficiency

I compared the number of padding tokens (the fewer the better) that we ended up with for different datasets and different sequence lengths.

Code used

import matplotlib.pyplot as plt
import numpy as np
from datasets import load_dataset
from trl.data_utils import pack_dataset
from transformers import AutoTokenizer

# Setup
num_samples = 50000

# Load and tokenize dataset
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")

###
# dataset = load_dataset("trl-lib/tldr", split="train").select(range(num_samples))

# def func(example):
#     return tokenizer(example["prompt"] + example["completion"])
# seq_lengths = [512, 1024, 2048]  # Different sequence lengths to compare
###

###
dataset = load_dataset("open-r1/Mixture-of-Thoughts", "all", split="train").select(range(num_samples))


def func(example):
    return {"input_ids": tokenizer.apply_chat_template(example["messages"])}

seq_lengths = [8192, 16384, 32768]  # Different sequence lengths to compare
###

dataset = dataset.map(func, remove_columns=dataset.column_names, num_proc=16)

# Get original sequence lengths for distribution plot
lengths_raw = [len(example["input_ids"]) for example in dataset]

# Calculate padding ratios for different sequence lengths and strategies
padding_results = []

for seq_length in seq_lengths:
    # FFD packing
    ffd_dataset = pack_dataset(dataset, seq_length=seq_length, strategy="ffd", map_kwargs={"num_proc": 16})
    lengths_ffd = [len(example["input_ids"]) for example in ffd_dataset]
    pad_ratio_ffd = sum(seq_length - length for length in lengths_ffd) / (num_samples * seq_length)

    # Fixed packing
    fixed_dataset = pack_dataset(dataset, seq_length=seq_length, strategy="fixed", map_kwargs={"num_proc": 16})
    lengths_fixed = [len(example["input_ids"]) for example in fixed_dataset]
    pad_ratio_fixed = sum(seq_length - length for length in lengths_fixed) / (num_samples * seq_length)

    # No packing (raw)
    pad_ratio_raw = sum(seq_length - length for length in lengths_raw if length < seq_length) / (
        num_samples * seq_length
    )

    padding_results.append([pad_ratio_raw, pad_ratio_fixed, pad_ratio_ffd])

# Create the plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(6, 3))

# Subplot 1: Distribution of sequence lengths
ax1.hist(lengths_raw, bins=50, edgecolor="black", alpha=0.7, color="skyblue")
ax1.set_xlabel("Sequence length (tokens)")
ax1.set_ylabel("Number of sequences")
ax1.set_title("Sequence lengths in dataset")
ax1.grid(True, alpha=0.3)

# Subplot 2: Padding comparison
x = np.arange(len(seq_lengths))
width = 0.25

strategies = ["No Packing", "Fixed", "FFD"]
colors = ["#ff7f7f", "#7f7fff", "#7fff7f"]

for i, (strategy, color) in enumerate(zip(strategies, colors)):
    values = [padding_results[j][i] for j in range(len(seq_lengths))]
    ax2.bar(x + i * width, values, width, label=strategy, color=color, edgecolor="black", alpha=0.8)

ax2.set_xlabel("Sequence Length (tokens)")
ax2.set_ylabel("Padding Ratio")
ax2.set_title("Padding tokens ratio\n(lower is better)")
ax2.set_xticks(x + width)
ax2.set_xticklabels(seq_lengths)
ax2.set_ylim(0, 0.5)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("padding_comparison.png")

HuggingFaceDocBuilderDev · 2025-06-01T23:36:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

edbeeching

Great work!
Awesome work!
I believe there are two things that still need to implemented in future PRs to make packing comparable to non-packing:

Correct Sequence IDs, currently the sequence IDs are created on the fly in the modelling code, so we do not account for the actual position of the packed sequences. This could be precomputed as passed to the model.
4D attention masks, many transformers implementations now support 4D attention masks in their model signature. This would mitigate potential issues from cross-attention between packed sequences.

To provide more details on the attn masks. Flash attn, for example, requires cumulative_seqlens which correspond to the start and end indices of each sequences in the (unflattened) batch, this can bepassed as flash_attn_kwargs

Links to example in the Qwen modelling code, for reference:

thepowerfuldeez · 2025-06-02T09:39:06Z

I could take a look on passing kwargs to flash attn kernel + making packing more efficient today! Great PR, long awaited.

trl/trainer/sft_config.py

tests/test_sft_trainer.py

…d_pack

new pack strat

c0925be

qgallouedec added 3 commits June 2, 2025 04:37

sft

8eef518

improve implementation, deprecated pack_examples and test

7e12779

fix test

22cad01

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani June 2, 2025 06:27

edbeeching approved these changes Jun 2, 2025

View reviewed changes

thepowerfuldeez reviewed Jun 2, 2025

View reviewed changes

trl/trainer/sft_config.py Outdated Show resolved Hide resolved

shirinyamani approved these changes Jun 2, 2025

View reviewed changes

tests/test_sft_trainer.py Show resolved Hide resolved

thepowerfuldeez mentioned this pull request Jun 2, 2025

Enable Numba for FFD packing algorithm #3524

Closed

5 tasks

qgallouedec changed the title ~~FFD packing~~ 📉 FFD packing Jun 2, 2025

thepowerfuldeez mentioned this pull request Jun 2, 2025

📦 Packing with flash attn kwargs to avoid cross-contamination #3526

Merged

qgallouedec and others added 4 commits June 2, 2025 11:38

Merge branch 'main' into ffd_pack

d72c49b

fixed -> wrapped

a9829a5

Merge branch 'ffd_pack' of https://github.com/huggingface/trl into ff…

2aa19b0

…d_pack

fix doc

bcb83fc

qgallouedec merged commit fef915e into main Jun 2, 2025
11 checks passed

qgallouedec deleted the ffd_pack branch June 2, 2025 20:15

qgallouedec restored the ffd_pack branch June 2, 2025 22:24

mariosasko mentioned this pull request Jun 4, 2025

⚡ Faster FFD packing #3537

Merged

5 tasks

This was referenced Jun 12, 2025

Update sft trainer to include better packing #3100

Closed

Probably a more reasonable method of packing #2466

Closed

mariosasko mentioned this pull request Jul 2, 2025

Faster position_ids computation for FFD packing #3649

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📉 FFD packing #3521

📉 FFD packing #3521

Uh oh!

qgallouedec commented Jun 1, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 1, 2025

Uh oh!

edbeeching left a comment •

edited

Loading

Uh oh!

thepowerfuldeez commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

📉 FFD packing #3521

📉 FFD packing #3521

Uh oh!

Conversation

qgallouedec commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Speed

Code used

Padding tokens efficiency

Code used

Uh oh!

HuggingFaceDocBuilderDev commented Jun 1, 2025

Uh oh!

edbeeching left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thepowerfuldeez commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Jun 1, 2025 •

edited

Loading

edbeeching left a comment •

edited

Loading