Skip to content

Conversation

@thomwolf
Copy link
Member

@thomwolf thomwolf commented Aug 31, 2020

Test if we can get better performances for large-scale text datasets by using multi-threaded text file loading based on Apache Arrow multi-threaded CSV loader.

If it works ok, it would fix #546.

Breaking change:
The text lines now do not include final line-breaks anymore.

@lhoestq
Copy link
Member

lhoestq commented Aug 31, 2020

Awesome !
Also I was wondering if we should try to make the hashing of the data_files faster (it is used to build the cache directory of datasets like text or json). Right now it reads each file and hashes all of its data. We could simply hash the path and some metadata including the time last modified tag no ? Apparently we can get this tag with os.path.getmtime(path)

@lhoestq
Copy link
Member

lhoestq commented Sep 4, 2020

I just rebased from master to include the hashing changes from #573

@thomwolf
Copy link
Member Author

thomwolf commented Sep 4, 2020

I think this is ready to merge, no?

@lhoestq
Copy link
Member

lhoestq commented Sep 8, 2020

Indeed it's ready to merge :)

@thomwolf thomwolf changed the title Switch text loading to multi-threaded PyArrow loading [Breaking] Switch text loading to multi-threaded PyArrow loading Sep 8, 2020
@thomwolf
Copy link
Member Author

thomwolf commented Sep 8, 2020

Ok added the breaking change info and we can merge indeed.

@thomwolf thomwolf merged commit 32eea04 into master Sep 8, 2020
@thomwolf thomwolf deleted the multithread-text branch September 8, 2020 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Very slow data loading on large dataset

3 participants