Skip to content

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 #3568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 12, 2025

Conversation

qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented Jun 11, 2025

What does this PR do?

This PR fixes two issues related to distributed setting:

Fixes #2779 #2979 #3567

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…worker initialization for compatibility with transformers 4.52.0
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@Tavish9 Tavish9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spawn method (the default on Windows and macOS), method requires that all objects (including functions) passed between processes need to be serialized (pickled).

Pickle only serializes objects that are defined at the top-level of a module.

So we need to move local function outwards to support MultiProcessingDataLoader under spawn context

@Tavish9
Copy link
Contributor

Tavish9 commented Jun 12, 2025

@qgallouedec
Copy link
Member Author

Thanks for the pointers @Tavish9! Added in solved issues

@@ -275,6 +278,11 @@ def nanmax(tensor: torch.Tensor) -> torch.Tensor:
return torch.max(tensor[~torch.isnan(tensor)])


def identity(x):
"""Do we really need docs for this?"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Do we really need docs for this?"""
"""GRPO does not need data_collator, to avoid crash, this simple function will be used as data_collator when initializing the GRPOTrainer"""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already document it below, I think it's not necessary

@shirinyamani
Copy link
Member

shirinyamani commented Jun 12, 2025

The identity method looks like what we had as data_collator method, so do you think identity name is more clear?

def data_collator(features):  # No data collation is needed in GRPO
    return features

also now that this is defines in global scope, I'm curios to see does this solve issue #3567

@qgallouedec
Copy link
Member Author

I prefer to keep the name identity because it reflects what the function does regardless of where it's used (even if we only use it as data collator). Naming based on behavior rather than context keeps the function general.

I'm curios to see does this solve issue

@Tavish9 gave some clue on this. Basically we need the class to be pickable, which is not the case when function are defined within methods

@qgallouedec qgallouedec merged commit 15ff547 into main Jun 12, 2025
19 of 20 checks passed
@qgallouedec qgallouedec deleted the fix-ddp branch June 12, 2025 17:13
@Tavish9
Copy link
Contributor

Tavish9 commented Jun 13, 2025

@qgallouedec Since this pr fixed two things: pickle and transformers update, it should also be linked to the following issues:

qgallouedec added a commit that referenced this pull request Jun 15, 2025
…worker initialization for compatibility with transformers 4.52.0 (#3568)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants