⚡ Pack 300 times faster, truncate 100 times faster #3009

mariosasko · 2025-03-04T23:19:27Z

What does this PR do?

Adds fast packing/truncation logic that operates directly on PyArrow arrays to avoid expensive Python-to-PyArrow and PyArrow-to-Python conversions. This makes these steps almost instantaneous regardless of the input dataset's size.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

lhoestq

Hi Mario ! :) just added a few comments but overall lgtm

lhoestq · 2025-03-11T14:26:11Z

trl/data_utils.py

+    """
+    if map_kwargs is None:
+        map_kwargs = {}
+    if isinstance(dataset, Dataset):


I guess it should also work for DatasetDict ?

No need to make it work with DatasetDict here, we apply a potentially different preprocessing depending on the split, see

trl/trl/trainer/sft_trainer.py

Lines 185 to 200 in e3244d2

preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)

if preprocess_dataset:

train_dataset = self._prepare_dataset(

train_dataset, processing_class, args, args.packing, formatting_func, "train"

)

if eval_dataset is not None:

packing = args.packing if args.eval_packing is None else args.eval_packing

if isinstance(eval_dataset, dict):

eval_dataset = {

key: self._prepare_dataset(dataset, processing_class, args, packing, formatting_func, key)

for key, dataset in eval_dataset.items()

}

else:

eval_dataset = self._prepare_dataset(

eval_dataset, processing_class, args, packing, formatting_func, "eval"

)

lhoestq · 2025-03-11T14:27:23Z

trl/data_utils.py

+                if pyarrow.types.is_list(column.type) or pyarrow.types.is_large_list(column.type):
+                    if isinstance(column, pa.ChunkedArray):
+                        column = column.combine_chunks()
+                    num_elements = len(column.values)
+                    dtype = column.offsets.type.to_pandas_dtype()  # np.int32 or np.int64
+                    offsets = np.arange(0, num_elements + 1, seq_length, dtype=dtype)
+                    if offsets[-1] != num_elements:
+                        offsets = np.concatenate([offsets, [num_elements]])
+                    column = type(column).from_arrays(offsets, column.values)


Is there a pyarrow.compute function you can use here instead to simplify this ?

The functions from pyarrow.compute generally do a copy of the data, but since .combine_chunks() also copies the data it might not affect performance that much

The compute API is very limited, so I think this is the best we can do :)

The only way to avoid the copy would be by packing the individual .chunks (without crossing the boundaries), but this would then make the operation dependent on the underlying chunking and lead to more padding afterward (when collating the batches), so I think the copy is justified here.

too bad pyarrow.compute is missing something like that... anyways let's go with this approach then :) My only remaining question is whether this approach works with sliced arrays ?

Made the change to handle that, but I don't remember in what scenario we receive sliced arrays as input, so without a test case.

lgtm ! I think it can happen when training on a small portion of the train set, like --dataset_train_split "train[:100]"

lhoestq · 2025-03-11T14:27:40Z

trl/data_utils.py

+    return dataset
+
+
+def truncate_dataset(


same comments for truncate

pyarrow.compute.list_slice reuses the .values buffers and only modifies the .offsets, so no unnecessary copies here.

HuggingFaceDocBuilderDev · 2025-03-11T14:34:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…cate

lhoestq

LGTM !

qgallouedec · 2025-03-22T18:19:41Z

Benchmark packing

import timeit
import numpy as np
from datasets import Dataset
from trl.data_utils import pack_examples, pack_dataset

# Create a larger dataset with sequence lengths following a gamma distribution
num_samples = 10_000

# Generate sequence lengths following a gamma distribution
seq_lengths = np.random.gamma(shape=5, scale=20, size=num_samples) # mean will be 100
seq_lengths = np.clip(seq_lengths, 10, None).astype(int)  # Clip to [10, inf)

# Generate input sequences with random lengths based on gamma distribution
examples = {
    "input_ids": [list(range(length)) for length in seq_lengths],
    "attention_mask": [[1] * length for length in seq_lengths],
}

dataset = Dataset.from_dict(examples)
max_length = 128  # Set a fixed packing length

# Benchmark pack_dataset
time_pack_dataset = timeit.timeit(lambda: pack_dataset(dataset, max_length), number=10)

# Benchmark dataset.map with pack_examples
time_pack_examples = timeit.timeit(
    lambda: dataset.map(pack_examples, batched=True, fn_kwargs={"seq_length": max_length}), number=10
)

print(f"pack_dataset time: {time_pack_dataset:.4f} seconds")
print(f"dataset.map(pack_examples) time: {time_pack_examples:.4f} seconds")

pack_dataset time: 0.0667 seconds
dataset.map(pack_examples) time: 19.3734 seconds
Speedup: 290.46x

qgallouedec · 2025-03-22T18:22:40Z

Benchmark truncate

import timeit
import numpy as np
from datasets import Dataset
from trl.data_utils import truncate_dataset


def truncate_examples(example, max_length):
    return {key: example[key][:max_length] for key in ["input_ids", "attention_mask"]}


# Create a larger dataset with sequence lengths following a gamma distribution
num_samples = 10_000

# Generate sequence lengths following a gamma distribution
seq_lengths = np.random.gamma(shape=5, scale=20, size=num_samples)  # mean will be 100
seq_lengths = np.clip(seq_lengths, 10, None).astype(int)  # Clip to [10, inf)

# Generate input sequences with random lengths based on gamma distribution
examples = {
    "input_ids": [list(range(length)) for length in seq_lengths],
    "attention_mask": [[1] * length for length in seq_lengths],
}

dataset = Dataset.from_dict(examples)
max_length = 128  # Set a fixed truncation length

# Benchmark truncate_dataset
time_truncate_dataset = timeit.timeit(lambda: truncate_dataset(dataset, max_length), number=10)

# Benchmark dataset.map with truncate_examples
time_truncate_examples = timeit.timeit(
    lambda: dataset.map(truncate_examples, batched=True, fn_kwargs={"max_length": max_length}), number=10
)

print(f"truncate_dataset time: {time_truncate_dataset:.4f} seconds")
print(f"dataset.map(truncate_examples) time: {time_truncate_examples:.4f} seconds")
print(f"Speedup: {time_truncate_examples / time_truncate_dataset:.2f}x")

truncate_dataset time: 0.0611 seconds
dataset.map(truncate_examples) time: 6.3807 seconds
Speedup: 104.47x

qgallouedec

Thanks @mariosasko @lhoestq 🔥🔥

…into pr/mariosasko/3009

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

mariosasko added 3 commits March 4, 2025 22:29

Fast truncation and packing

60d54d1

Nit

987e050

Fix typo

1e027a1

mariosasko mentioned this pull request Mar 4, 2025

Fast dataset truncate in SFTTrainer #2898

Closed

5 tasks

lhoestq reviewed Mar 11, 2025

View reviewed changes

mariosasko added 5 commits March 16, 2025 09:54

Fix conflict

cc57863

Nit

00cc012

Remove if

8f4498f

Handle sliced arrays

7e7854a

Merge branch 'main' of github.com:huggingface/trl into fast-pack-trun…

b12c2cf

…cate

lhoestq approved these changes Mar 20, 2025

View reviewed changes

add an extra test

0f1b7bb

qgallouedec changed the title ~~Fast packing and truncation~~ ⚡ Pack 300 times faster, truncate 100 times faster Mar 22, 2025

Merge branch 'main' into fast-pack-truncate

d9b33d4

qgallouedec approved these changes Mar 22, 2025

View reviewed changes

Merge branch 'fast-pack-truncate' of https://github.com/mariosasko/trl …

d777a2d

…into pr/mariosasko/3009

qgallouedec merged commit 7511aa4 into huggingface:main Mar 22, 2025
7 of 13 checks passed

mariosasko deleted the fast-pack-truncate branch March 25, 2025 00:01

kashif pushed a commit to kashif/trl that referenced this pull request Mar 28, 2025

⚡ Pack 300 times faster, truncate 100 times faster (huggingface#3009)

804479c

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

	preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
	if preprocess_dataset:
	train_dataset = self._prepare_dataset(
	train_dataset, processing_class, args, args.packing, formatting_func, "train"
	)
	if eval_dataset is not None:
	packing = args.packing if args.eval_packing is None else args.eval_packing
	if isinstance(eval_dataset, dict):
	eval_dataset = {
	key: self._prepare_dataset(dataset, processing_class, args, packing, formatting_func, key)
	for key, dataset in eval_dataset.items()
	}
	else:
	eval_dataset = self._prepare_dataset(
	eval_dataset, processing_class, args, packing, formatting_func, "eval"
	)

⚡ Pack 300 times faster, truncate 100 times faster #3009

⚡ Pack 300 times faster, truncate 100 times faster #3009

Uh oh!

Conversation

mariosasko commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariosasko Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 11, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark packing

Uh oh!

qgallouedec commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark truncate

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mariosasko commented Mar 4, 2025 •

edited

Loading

mariosasko Mar 16, 2025 •

edited

Loading

qgallouedec commented Mar 22, 2025 •

edited

Loading

qgallouedec commented Mar 22, 2025 •

edited

Loading