Comparison of reading time in pySpark #27

thepushkarp · 2022-07-24T14:49:00Z

thepushkarp
Jul 24, 2022
Collaborator

Make a detailed comparison between the following to make an informed decision about the choice of DataFrame to use:

Reading files through pySpark file by file v/s all at once
Reading files through pySpark without schema v/s with schema.

thepushkarp · 2022-07-24T19:28:12Z

thepushkarp
Jul 24, 2022
Collaborator Author

Comparison of reading files one by one vs all together in pySpark. We can see that reading them all together is much faster.

0 replies

thepushkarp · 2022-07-24T19:48:23Z

thepushkarp
Jul 24, 2022
Collaborator Author

Comparison of reading files with vs without schema in pySpark. We can see that reading them with schema is much faster.

0 replies

afolarin · 2022-07-25T09:12:30Z

afolarin
Jul 25, 2022
Maintainer

v nice work @thepushkarp

Comparison of reading files one by one vs all together in pySpark. We can see that reading them all together is much faster.

presume this is with schema e.g. 2615KB in ~15 sec? Even though it seems faster than the equivalent 2615KB read in the (schema vs no-schema) plot ~20 sec.

0 replies

thepushkarp · 2022-07-25T17:57:37Z

thepushkarp
Jul 25, 2022
Collaborator Author

@afolarin

presume this is with schema

Yes, it was done with schema. In fact, the one with schema, and the one where all files are read together have the same lines of code.

Even though it seems faster than the equivalent 2615KB read in the (schema vs no-schema) plot ~20 sec.

Thanks for pointing it out. I computed this locally on my PC, and the plot shows the mean score of 10 runs, with variance as the black bar. I guess this difference might be due to some background process, as the plot with 2615KB read has a larger variance compared to other plots. Still, I will try to simulate it again in an isolated environment such as Google Colab and re-report the findings.

0 replies

thepushkarp · 2022-07-25T20:46:59Z

thepushkarp
Jul 25, 2022
Collaborator Author

Updated comparison graphs run on Google Colab. The times with schema and with all files together are now similar in both graphs. I think some background processes on my local machine would have been the culprit behind the discrepancy!

Link to Colab Notebook: https://colab.research.google.com/drive/19gXQfZ2feazau4rXlG17qbDAoj_6csD_?usp=sharing

0 replies

afolarin · 2022-07-26T07:57:12Z

afolarin
Jul 26, 2022
Maintainer

Would be interesting to see if the speedup is constant or not, as you scale up. Naively I might expect an increase in efficiency as you presumably need to read the data completely before inferring the schema in the 'no-schema' read. But might eyeballing the set maybe this isn't the case?

0 replies

Hsankesara · 2022-07-26T08:37:29Z

Hsankesara
Jul 26, 2022
Maintainer

Yeah, would be a good experiment to see how much reading time changes when you scale up. I think we could use a study data to do it once the i/o module is done.

0 replies

thepushkarp · 2022-07-26T11:43:35Z

thepushkarp
Jul 26, 2022
Collaborator Author

Sure, it would be interesting to run this analysis on data of the scale of real-world data.

I think we could use a study data to do it once the i/o module is done.

Cool!

0 replies

Comparison of reading time in pySpark #27

Uh oh!

Uh oh!

thepushkarp Jul 24, 2022 Collaborator

Replies: 8 comments

Uh oh!

thepushkarp Jul 24, 2022 Collaborator Author

Uh oh!

thepushkarp Jul 24, 2022 Collaborator Author

Uh oh!

Uh oh!

afolarin Jul 25, 2022 Maintainer

Uh oh!

thepushkarp Jul 25, 2022 Collaborator Author

Uh oh!

Uh oh!

thepushkarp Jul 25, 2022 Collaborator Author

Uh oh!

afolarin Jul 26, 2022 Maintainer

Uh oh!

Hsankesara Jul 26, 2022 Maintainer

Uh oh!

thepushkarp Jul 26, 2022 Collaborator Author

thepushkarp
Jul 24, 2022
Collaborator

thepushkarp
Jul 24, 2022
Collaborator Author

thepushkarp
Jul 24, 2022
Collaborator Author

afolarin
Jul 25, 2022
Maintainer

thepushkarp
Jul 25, 2022
Collaborator Author

thepushkarp
Jul 25, 2022
Collaborator Author

afolarin
Jul 26, 2022
Maintainer

Hsankesara
Jul 26, 2022
Maintainer

thepushkarp
Jul 26, 2022
Collaborator Author