ARROW-3772: [C++][Parquet] Write Parquet dictionary indices directly to DictionaryBuilder rather than routing through dense form #4949

wesm · 2019-07-26T02:00:56Z

I sort of ran into a hill of fire ants with this. There is too
much to do in a single patch so I'm going to list out some of the
things here and the follow up items that will have to be dealt
with expediently in follow up patches:

Adds APIs to DictionaryBuilder to be able to insert dictionary memo values
(InsertMemoValues) and indices (AppendIndices as int64_t*) separately
Adds GetBatchSpaced to RleDecoder
When setting "read_dictionary" for an Arrow column, now
dictionary indices are appended directly to DictionaryBuilder
rather than being fully decoded to dense, then rehashed. So
this will both save memory and improve performance by skipping
hashing
Dictionary fallback is no problem now because subsequent non-encoded pages
will be appended to the existing dictionary
Each time a new dictionary page is encountered, the DictionaryBuilder is
"flushed" since the order of the dictionary is changed. Note that this is
only possible now due to ARROW-3144
Now that each row group can yield a different dictionary, they can possibly
have different dictionary index types. This is a bug. Rather than explode the
size of this patch I opened ARROW-6042 to add an exclusively int32-index
dictionary builder so we can fix this as follow up
The handling of dictionaries in parquet/arrow/reader.cc is a mess -- the
structure of the column readers with respect to nested data is very messy,
and as a result it's not possible to do direct dictionary decoding inside a
nested subfield. We will have to fix this in the course of refactoring for
improved nested data support
Refactoring in parquet/arrow/reader.cc for improved readability

There is much follow up work to do here, so I will get busy with with the other things while this is being reviewed.

wesm · 2019-07-26T02:01:11Z

cc @hatemhelal @xhochy

wesm · 2019-07-26T02:02:43Z

I haven't run benchmarks yet, will post before/after numbers for a representative data set (with a lot of dictionary compression) when I can

xhochy · 2019-07-26T14:12:02Z

Gave this a rough review and it looks good to me.

wesm · 2019-07-26T19:41:09Z

cpp/src/arrow/array/builder_dict.cc

@pitrou I ran into brittleness with the way the template specialization was working in this file. I refactored things to more explicitly use std::enable_if

wesm · 2019-07-26T21:37:36Z

This should be pretty close... I'm going to merge this once I get the build passing so I am not stacking patches (of course, I will happily address further feedback in subsequent patches)

Add methods to DictionaryBuilder to insert dictionary memo values and indices independently Prototype writing dictionary values / indices directly [skip ci]

wesm · 2019-07-26T23:54:10Z

Travis CI build: https://travis-ci.org/wesm/arrow/builds/564234236

hatemhelal

Sorry it took me a few days to get to reviewing this. Overall, looking neater but I had some minor questions / comments.

hatemhelal · 2019-07-29T12:03:32Z

cpp/src/parquet/arrow/arrow-reader-writer-test.cc

 };

-typedef std::function<void(int, std::shared_ptr<::DataType>*, std::shared_ptr<Array>*)>
+typedef std::function<void(int, std::shared_ptr<DataType>*, std::shared_ptr<Array>*)>


Is this a copy from line 2015? I think this could be written as a using alias which I find easier to read.

hatemhelal · 2019-07-29T12:10:44Z

cpp/src/parquet/arrow/reader.cc

+  TransferFunctor<ArrowType, ParquetType> func; \
+  RETURN_NOT_OK(func(record_reader_.get(), pool_, value_type, &result));

 #define TRANSFER_CASE(ENUM, ArrowType, ParquetType) \


I think this macro can be removed.

hatemhelal · 2019-07-29T12:16:47Z

cpp/src/parquet/arrow/schema.cc

  return Status::OK();
 }

+std::shared_ptr<Field> ToDictionary32(const Field& field) {


I don't see where this is used?

I'm planning to do some refactoring related to the schema conversion and instantiation of column readers, will return to this then

hatemhelal · 2019-07-29T12:21:30Z

cpp/src/parquet/arrow/schema.h

 /// \param column_indices indices of leaf nodes in parquet schema tree. Appearing ordering
 ///                       matters for the converted schema. Repeated indices are ignored
 ///                       except for the first one
+/// \param properties reader options for FileReader


It looks like the properties parameter is unused - is there a reason to prefer adding it?

Re: above, the way that dictionary-encoded fields are handled in reader.cc right now is very hacky, I started adding logic to schema.cc to create DictionaryType instances and found that I was blocked by the current structure of reader.cc. I will return to this

hatemhelal · 2019-07-29T12:33:27Z

cpp/src/parquet/arrow/writer.cc


 #include "arrow/util/logging.h"

+#include "parquet/arrow/reader.h"


I think this relates to my comment on the additional parameter in FromParquetSchema. Having the writer depend on the reader at first-glance appears as an anti-pattern.

Will clean up includes, this was about getting access to ArrowReaderProperties

hatemhelal · 2019-07-29T12:40:09Z

cpp/src/parquet/arrow/writer.cc

    // will not work for arbitrary nested data
    int current_column_idx = row_group_writer_->current_column();
    std::shared_ptr<::arrow::Schema> arrow_schema;
    RETURN_NOT_OK(FromParquetSchema(writer_->schema(), {current_column_idx - 1},


Could this be avoided by using the arrow::Schema property on the FileWriter? I read this as we are recomputing the mapping from parquet to arrow when we already have the original schema passed in to create the writer instance. Eliminating this would resolve the unexpected coupling between the writer code on the reader.

@wesm

With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving the explicit call to DictionaryEncode() redundant @wesm Closes #5299 from bkietz/6434-Crossbow-Nightly-HDFS-int and squashes the following commits: b29e6b7 <Benjamin Kietzman> don't try to dictionary encode dictionary arrays Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>

wesm commented Jul 26, 2019

View reviewed changes

wesm added 11 commits July 26, 2019 17:58

Some refactoring / cleaning to prep

cbeea80

Add methods to DictionaryBuilder to insert dictionary memo values and indices independently Prototype writing dictionary values / indices directly [skip ci]

Add RleDecoder::GetBatchSpaced

2d6558b

Closer to working decoder implementation [skip ci]

418ffe1

More refactoring [skip ci]

ad253ed

Compiling again, but segfaulting [skip ci]

246e11c

Fix up unit test now that each row group has a different dictionary

21e0cc7

Fix release builds and Python

2b42440

Fix API usages

04c9500

Fix MSVC issues [skip ci]

e78f32b

Add missing Status checks now that wrapper gone

d551b3d

Add missing documentation

0269629

wesm force-pushed the ARROW-3772 branch from 2ade4dd to 0269629 Compare July 26, 2019 22:58

wesm closed this in 38b0176 Jul 26, 2019

wesm deleted the ARROW-3772 branch July 26, 2019 23:56

hatemhelal reviewed Jul 29, 2019

View reviewed changes

bkietz mentioned this pull request Sep 5, 2019

ARROW-6475: [C++] Don't try to dictionary encode dictionary arrays #5299

Closed

This was referenced Jul 17, 2021

[C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray #20110

Closed

[C++] Don't try to dictionary encode dictionary arrays #22844

Closed


		#include "arrow/util/logging.h"

		#include "parquet/arrow/reader.h"

ARROW-3772: [C++][Parquet] Write Parquet dictionary indices directly to DictionaryBuilder rather than routing through dense form #4949

ARROW-3772: [C++][Parquet] Write Parquet dictionary indices directly to DictionaryBuilder rather than routing through dense form #4949

Uh oh!

Conversation

wesm commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Jul 26, 2019

Uh oh!

wesm commented Jul 26, 2019

Uh oh!

xhochy commented Jul 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Jul 26, 2019

Uh oh!

hatemhelal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Jul 26, 2019 •

edited

Loading

wesm commented Jul 26, 2019 •

edited

Loading