Skip to content

nlp.load_dataset is not safe for multi processes when loading from local files #543

@luyug

Description

@luyug

Loading from local files, e.g., dataset = nlp.load_dataset('csv', data_files=['file_1.csv', 'file_2.csv'])
concurrently from multiple processes, will raise FileExistsError from builder's line 430, https://github.com/huggingface/nlp/blob/6655008c738cb613c522deb3bd18e35a67b2a7e5/src/nlp/builder.py#L423-L438

Likely because multiple processes step into download_and_prepare, https://github.com/huggingface/nlp/blob/6655008c738cb613c522deb3bd18e35a67b2a7e5/src/nlp/load.py#L550-L554

This can happen when launching distributed training with commands like python -m torch.distributed.launch --nproc_per_node 4 on a new collection of files never loaded before.

I can create a PR that puts in some file locks. It would be helpful if I can be informed of the convention for naming and placement of the lock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions