Delta Lake tables are very slow with DuckDB, faster with DataFusion, break with Polars for 1 billion rows #6771
                  
                    
                      lostmygithubaccount
                    
                  
                
                  started this conversation in
                General
              
            Replies: 0 comments
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
I wanted to generate 1 billion rows of data and do some comparisons between backends. I ended up with a script like this to generate the data:
resulting in a decent amount of data (larger than RAM):
interesting behavior observed between DuckDB, DataFusion, and Polars, between Parquet and Delta Lake. to summarize:
read_parquetand very slow forread_deltaread_parquetand much faster forread_delta(goofy timing in the code below was due to issues with
%time)DuckDB:
DataFusion:
Polars (both fail after tens of seconds):
I'm not really sure what to make of this but figured I'd document it here
Beta Was this translation helpful? Give feedback.
All reactions