🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 #3568

qgallouedec · 2025-06-11T15:24:06Z

What does this PR do?

This PR fixes two issues related to distributed setting:

Transformers breaking change, seed_worker now requires two new arguments: num_workers and rank, see update seed_worker to set seed based on worker_id and rank transformers#37980
And pickling error because of collator being defined inside GRPOTrainer, see Data collator not found during pickling with trl 0.18.1 and pytorch 2.7 #3567

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…worker initialization for compatibility with transformers 4.52.0

HuggingFaceDocBuilderDev · 2025-06-11T15:27:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Tavish9

For spawn method (the default on Windows and macOS), method requires that all objects (including functions) passed between processes need to be serialized (pickled).

Pickle only serializes objects that are defined at the top-level of a module.

So we need to move local function outwards to support MultiProcessingDataLoader under spawn context

trl/trainer/grpo_trainer.py

Tavish9 · 2025-06-12T12:23:38Z

please also link to

ORPOTrainer crashes due to pickling failure if dataloader_num_workers > 0 #2779
AttributeError: Can't pickle local object 'GRPOTrainer.__init__.<locals>.data_collator' #2979

same as

Data collator not found during pickling with trl 0.18.1 and pytorch 2.7 #3567

qgallouedec · 2025-06-12T14:29:10Z

Thanks for the pointers @Tavish9! Added in solved issues

shirinyamani · 2025-06-12T15:32:56Z

trl/trainer/grpo_trainer.py

@@ -275,6 +278,11 @@ def nanmax(tensor: torch.Tensor) -> torch.Tensor:
    return torch.max(tensor[~torch.isnan(tensor)])


+def identity(x):
+    """Do we really need docs for this?"""


Suggested change

"""Do we really need docs for this?"""

"""GRPO does not need data_collator, to avoid crash, this simple function will be used as data_collator when initializing the GRPOTrainer"""

we already document it below, I think it's not necessary

shirinyamani · 2025-06-12T15:38:40Z

The identity method looks like what we had as data_collator method, so do you think identity name is more clear?

def data_collator(features):  # No data collation is needed in GRPO
    return features

also now that this is defines in global scope, I'm curios to see does this solve issue #3567

qgallouedec · 2025-06-12T15:47:35Z

I prefer to keep the name identity because it reflects what the function does regardless of where it's used (even if we only use it as data collator). Naming based on behavior rather than context keeps the function general.

I'm curios to see does this solve issue

@Tavish9 gave some clue on this. Basically we need the class to be pickable, which is not the case when function are defined within methods

Tavish9 · 2025-06-13T01:57:06Z

@qgallouedec Since this pr fixed two things: pickle and transformers update, it should also be linked to the following issues:

GRPOTrainer crashs with dataloader_num_workers >= 1 #3544

…worker initialization for compatibility with transformers 4.52.0 (#3568)

🏗️ Add test for training with multiple dataloader workers and update …

d9719b9

…worker initialization for compatibility with transformers 4.52.0

style

d295e22

This was referenced Jun 11, 2025

Data collator not found during pickling with trl 0.18.1 and pytorch 2.7 #3567

Closed

Update PULL_REQUEST_TEMPLATE.md huggingface/transformers#38770

Merged

Tavish9 suggested changes Jun 12, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

qgallouedec and others added 4 commits June 12, 2025 07:39

Move colator outside GRPO

155fdc5

fix

9006cdd

Merge branch 'main' into fix-ddp

477af30

docstring

7b52213

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani June 12, 2025 08:58

Merge branch 'main' into fix-ddp

2d4d319

This was linked to issues Jun 12, 2025

Data collator not found during pickling with trl 0.18.1 and pytorch 2.7 #3567

Closed

AttributeError: Can't pickle local object 'GRPOTrainer.__init__.<locals>.data_collator' #2979

Closed

qgallouedec mentioned this pull request Jun 12, 2025

🔧 Use partial for worker_init_fn in GRPO and OnlineDPO trainers to include num_workers and rank, fixes bug https://github.com/huggingface/trl/issues/3544 #3547

Closed

5 tasks

shirinyamani approved these changes Jun 12, 2025

View reviewed changes

qgallouedec merged commit 15ff547 into main Jun 12, 2025
19 of 20 checks passed

qgallouedec deleted the fix-ddp branch June 12, 2025 17:13

qgallouedec added a commit that referenced this pull request Jun 15, 2025

🏗️ Add test for training with multiple dataloader workers and update …

1a67176

…worker initialization for compatibility with transformers 4.52.0 (#3568)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 #3568

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 #3568

Uh oh!

qgallouedec commented Jun 11, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

Tavish9 left a comment

Uh oh!

Uh oh!

Tavish9 commented Jun 12, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Jun 12, 2025

Uh oh!

shirinyamani Jun 12, 2025

Uh oh!

qgallouedec Jun 12, 2025

Uh oh!

shirinyamani commented Jun 12, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Jun 12, 2025

Uh oh!

Uh oh!

Tavish9 commented Jun 13, 2025

Uh oh!

Uh oh!

	"""Do we really need docs for this?"""
	"""GRPO does not need data_collator, to avoid crash, this simple function will be used as data_collator when initializing the GRPOTrainer"""

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 #3568

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 #3568

Uh oh!

Conversation

qgallouedec commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

Tavish9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tavish9 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jun 12, 2025

Uh oh!

shirinyamani Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

shirinyamani commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jun 12, 2025

Uh oh!

Uh oh!

Tavish9 commented Jun 13, 2025

Uh oh!

Uh oh!

qgallouedec commented Jun 11, 2025 •

edited

Loading

Tavish9 commented Jun 12, 2025 •

edited

Loading

shirinyamani commented Jun 12, 2025 •

edited

Loading