-
Notifications
You must be signed in to change notification settings - Fork 451
Open
Labels
Description
Describe the enhancement requested
At the moment, it is not clear to me what the semantics are of the ColumnChunk-level statistics of nested columns.
It appears that it should be based on the leaf column (which makes sense to me), but then the null_count (and distinct_count probably) are seemingly based partially on the nested level.
import polars as pl
import io
import pyarrow.parquet as pq
df = pl.DataFrame([
pl.Series('a', [[1, 2, 3], None], pl.Array(pl.Int32, 3)),
])
f = io.BytesIO()
pq.write_table(df.to_arrow(), f)
f.seek(0)
pq.read_metadata(f).row_group(0).column(0).statistics<pyarrow._parquet.Statistics object at 0x7ffe9bd626b0>
has_min_max: True
min: 1
max: 3
null_count: 1
distinct_count: None
num_values: 3
physical_type: INT32
logical_type: None
converted_type (legacy): NONEI would expect the null_count to equal 3 here if it was based on the leaf column.
Because of this, null_count basically has to be ignored for nested columns. Ideally, we would have a list of null_count for every nullable level of the nesting, but otherwise just specifying the semantics is good.