nlp.load_dataset is not safe for multi processes when loading from local files

Loading from local files, e.g., `dataset = nlp.load_dataset('csv', data_files=['file_1.csv', 'file_2.csv'])`
concurrently from multiple processes, will raise `FileExistsError` from builder's line 430, https://github.com/huggingface/nlp/blob/6655008c738cb613c522deb3bd18e35a67b2a7e5/src/nlp/builder.py#L423-L438

Likely because multiple processes step into download_and_prepare, https://github.com/huggingface/nlp/blob/6655008c738cb613c522deb3bd18e35a67b2a7e5/src/nlp/load.py#L550-L554

This can happen when launching distributed training with commands like `python -m torch.distributed.launch --nproc_per_node 4` on a new collection of files never loaded before.

I can create a PR that puts in some file locks. It would be helpful if I can be informed of the convention for naming and placement of the lock.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nlp.load_dataset is not safe for multi processes when loading from local files #543

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nlp.load_dataset is not safe for multi processes when loading from local files #543

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions