Document the behavior of Nested ColumnChunk statistics 

### Describe the enhancement requested

At the moment, it is not clear to me what the semantics are of the `ColumnChunk`-level statistics of nested columns.

It appears that it should be based on the leaf column (which makes sense to me), but then the `null_count` (and `distinct_count` probably) are seemingly based partially on the nested level.

```python
import polars as pl
import io
import pyarrow.parquet as pq

df = pl.DataFrame([
    pl.Series('a', [[1, 2, 3], None], pl.Array(pl.Int32, 3)),
])

f = io.BytesIO()
pq.write_table(df.to_arrow(), f)

f.seek(0)
pq.read_metadata(f).row_group(0).column(0).statistics
```

```console
<pyarrow._parquet.Statistics object at 0x7ffe9bd626b0>
  has_min_max: True
  min: 1
  max: 3
  null_count: 1
  distinct_count: None
  num_values: 3
  physical_type: INT32
  logical_type: None
  converted_type (legacy): NONE
```

I would expect the `null_count` to equal `3` here if it was based on the leaf column. 

Because of this, `null_count` basically has to be ignored for nested columns. Ideally, we would have a list of `null_count` for every nullable level of the nesting, but otherwise just specifying the semantics is good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document the behavior of Nested ColumnChunk statistics #476

Describe the enhancement requested

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Document the behavior of Nested ColumnChunk statistics #476

Description

Describe the enhancement requested

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions