Skip to content

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

@asfimport

Description

@asfimport

Dictionary data is very common in parquet, in the current implementation parquet-cpp decodes dictionary encoded data always before creating a plain arrow array. This process is wasteful since we could use arrow's DictionaryArray directly and achieve several benefits:

  1. Smaller memory footprint - both in the decoding process and in the resulting arrow table - especially when the dict values are large

  2. Better decoding performance - mostly as a result of the first bullet - less memory fetches and less allocations.

    I think those benefits could achieve significant improvements in runtime.

    My direction for the implementation is to read the indices (through the DictionaryDecoder, after the RLE decoding) and values separately into 2 arrays and create a DictionaryArray using them.

    There are some questions to discuss:

  3. Should this be the default behavior for dictionary encoded data

  4. Should it be controlled with a parameter in the API

  5. What should be the policy in case some of the chunks are dictionary encoded and some are not.

    I started implementing this but would like to hear your opinions.

Reporter: Stav Nir
Assignee: Wes McKinney / @wesm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3772. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions