[Breaking] Switch text loading to multi-threaded PyArrow loading #548

thomwolf · 2020-08-31T15:15:41Z

Test if we can get better performances for large-scale text datasets by using multi-threaded text file loading based on Apache Arrow multi-threaded CSV loader.

If it works ok, it would fix #546.

Breaking change:
The text lines now do not include final line-breaks anymore.

lhoestq · 2020-08-31T18:15:05Z

Awesome !
Also I was wondering if we should try to make the hashing of the data_files faster (it is used to build the cache directory of datasets like text or json). Right now it reads each file and hashes all of its data. We could simply hash the path and some metadata including the time last modified tag no ? Apparently we can get this tag with os.path.getmtime(path)

lhoestq · 2020-09-04T14:58:11Z

I just rebased from master to include the hashing changes from #573

thomwolf · 2020-09-04T18:00:18Z

I think this is ready to merge, no?

lhoestq · 2020-09-08T07:44:37Z

Indeed it's ready to merge :)

thomwolf · 2020-09-08T10:19:52Z

Ok added the breaking change info and we can merge indeed.

thomwolf force-pushed the multithread-text branch from a236469 to 7c9da56 Compare September 1, 2020 07:54

thomwolf mentioned this pull request Sep 1, 2020

Very slow data loading on large dataset #546

Closed

lhoestq mentioned this pull request Sep 4, 2020

Faster caching for text dataset #573

Merged

thomwolf added 7 commits September 4, 2020 16:55

Switch text loading to multi-threaded PyArrow loading

d5b9ef8

updating script

e4507dd

try to pin some dependencies for our benchmarks

d13833c

run benchmarks in venv

0e18fda

test

28db44a

remove venv in benchmarks...

a7e7804

try using python3

f11c377

lhoestq force-pushed the multithread-text branch from ce2a48f to f11c377 Compare September 4, 2020 14:57

lhoestq added 2 commits September 7, 2020 18:18

revert pip install changes for dvc benchmarks

6e8f556

go back to python instead of python3

ba1fd22

thomwolf changed the title ~~Switch text loading to multi-threaded PyArrow loading~~ [Breaking] Switch text loading to multi-threaded PyArrow loading Sep 8, 2020

thomwolf merged commit 32eea04 into master Sep 8, 2020

thomwolf deleted the multithread-text branch September 8, 2020 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Breaking] Switch text loading to multi-threaded PyArrow loading #548

[Breaking] Switch text loading to multi-threaded PyArrow loading #548

Uh oh!

thomwolf commented Aug 31, 2020 •

edited

Loading

Uh oh!

lhoestq commented Aug 31, 2020 •

edited

Loading

Uh oh!

lhoestq commented Sep 4, 2020

Uh oh!

thomwolf commented Sep 4, 2020

Uh oh!

lhoestq commented Sep 8, 2020

Uh oh!

thomwolf commented Sep 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Breaking] Switch text loading to multi-threaded PyArrow loading #548

[Breaking] Switch text loading to multi-threaded PyArrow loading #548

Uh oh!

Conversation

thomwolf commented Aug 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Aug 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Sep 4, 2020

Uh oh!

thomwolf commented Sep 4, 2020

Uh oh!

lhoestq commented Sep 8, 2020

Uh oh!

thomwolf commented Sep 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomwolf commented Aug 31, 2020 •

edited

Loading

lhoestq commented Aug 31, 2020 •

edited

Loading