-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the enhancement requested
Both the Rust and Java implementations limit the number of rows written per page:
- Rust: https://github.com/apache/arrow-rs/blob/3126dad0348035bc5fadc8ec61b7150b9ce6aad5/parquet/src/file/properties.rs#L42
- Java: https://github.com/apache/parquet-java/blob/4aa2ea91863274aebb1eded243ce275912c16010/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L61
They do this in addition to trying to keep the page size under 1 MB. This allows keeping the actual page size to a much smaller value.
However, in Parquet C++ we only have the 1 MB page size limit, but do not limit the number of rows written. This can result in much larger pages than with other implementations.
Large pages can have several problems:
- less CPU cache efficiency when reading, decompressing, etc.
- less fine-grained page pruning using predicate pushdown
- larger intermediate buffers, leading to a significant increase in memory consumption if there are many columns to read
Component(s)
C++