Skip to content

Document the behavior of Nested ColumnChunk statistics  #476

@coastalwhite

Description

@coastalwhite

Describe the enhancement requested

At the moment, it is not clear to me what the semantics are of the ColumnChunk-level statistics of nested columns.

It appears that it should be based on the leaf column (which makes sense to me), but then the null_count (and distinct_count probably) are seemingly based partially on the nested level.

import polars as pl
import io
import pyarrow.parquet as pq

df = pl.DataFrame([
    pl.Series('a', [[1, 2, 3], None], pl.Array(pl.Int32, 3)),
])

f = io.BytesIO()
pq.write_table(df.to_arrow(), f)

f.seek(0)
pq.read_metadata(f).row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ffe9bd626b0>
  has_min_max: True
  min: 1
  max: 3
  null_count: 1
  distinct_count: None
  num_values: 3
  physical_type: INT32
  logical_type: None
  converted_type (legacy): NONE

I would expect the null_count to equal 3 here if it was based on the leaf column.

Because of this, null_count basically has to be ignored for nested columns. Ideally, we would have a list of null_count for every nullable level of the nesting, but otherwise just specifying the semantics is good.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions