Skip to content

[C++][Parquet] Add setting to limit the number of rows written per page #47030

@pitrou

Description

@pitrou

Describe the enhancement requested

Both the Rust and Java implementations limit the number of rows written per page:

They do this in addition to trying to keep the page size under 1 MB. This allows keeping the actual page size to a much smaller value.

However, in Parquet C++ we only have the 1 MB page size limit, but do not limit the number of rows written. This can result in much larger pages than with other implementations.

Large pages can have several problems:

  1. less CPU cache efficiency when reading, decompressing, etc.
  2. less fine-grained page pruning using predicate pushdown
  3. larger intermediate buffers, leading to a significant increase in memory consumption if there are many columns to read

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions