[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

Dictionary data is very common in parquet, in the current implementation parquet-cpp decodes dictionary encoded data always before creating a plain arrow array. This process is wasteful since we could use arrow's DictionaryArray directly and achieve several benefits:
1. Smaller memory footprint - both in the decoding process and in the resulting arrow table - especially when the dict values are large
1. Better decoding performance - mostly as a result of the first bullet - less memory fetches and less allocations.
   
   I think those benefits could achieve significant improvements in runtime.
   
   My direction for the implementation is to read the indices (through the DictionaryDecoder, after the RLE decoding) and values separately into 2 arrays and create a DictionaryArray using them.
   
   There are some questions to discuss:
1. Should this be the default behavior for dictionary encoded data
1. Should it be controlled with a parameter in the API
1. What should be the policy in case some of the chunks are dictionary encoded and some are not.
   
   I started implementing this but would like to hear your opinions.

**Reporter**: [Stav Nir](https://issues.apache.org/jira/browse/ARROW-3772)
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-3772) / @wesm
#### Related issues:
- [[C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY](https://github.com/apache/arrow/issues/22534) (relates to)
- [[Python] CategoricalIndex is lost after reading back](https://github.com/apache/arrow/issues/19959) (is related to)
- [[Python] Reading a dictionary column from Parquet results in disproportionate memory usage](https://github.com/apache/arrow/issues/22400) (is related to)
- [[C++] Provide method on AdaptiveIntBuilder for appending integer Array types](https://github.com/apache/arrow/issues/22392) (is related to)
- [[Python] Support reading Parquet binary/string columns directly as DictionaryArray](https://github.com/apache/arrow/issues/19660) (is related to)
- [[C++] Support using Array::View from compatible dictionary type to another](https://github.com/apache/arrow/issues/22451) (is related to)
- [[Python][Parquet] direct reading/writing of pandas categoricals in parquet](https://github.com/apache/arrow/issues/19588) (is related to)
- [[Python] Support reading Parquet binary/string columns directly as DictionaryArray](https://github.com/apache/arrow/issues/19660) (is depended upon by)
#### PRs and other links:
- [GitHub Pull Request #4949](https://github.com/apache/arrow/pull/4949)

<sub>**Note**: *This issue was originally created as [ARROW-3772](https://issues.apache.org/jira/browse/ARROW-3772). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions