Skip to content

a collection of utilities for efficiently working with large-scale Parquet datasets

License

elphick/parq-tools

Repository files navigation

parq-tools

Run Tests PyPI Coverage Python Versions License Publish Docs Open Issues Open PRs

Overview

parq-tools is a collection of utilities for efficiently working with large-scale Parquet datasets. A typical use case is asset-based workflows with large scientific datasets.

:::note If your datasets are not large, you might find the pandas library more convenient. :::

Features

  • Filtering → Efficiently filter large parquet files.
  • Concatenation → Combines multiple Parquet files efficiently along rows (axis=0) or columns (axis=1).
  • Tokenized Filtering → Converts pandas-style expressions into efficient PyArrow queries.
  • Profiling Enhancements → Improves ydata-profiling by profiling specific columns incrementally, merging results for large files.
  • DataFrame Enhancements → Provides a LazyParquetDataFrame class that extends pandas.DataFrame with lazy loading from Parquet files.

About

a collection of utilities for efficiently working with large-scale Parquet datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages