-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3772: [C++][Parquet] Write Parquet dictionary indices directly to DictionaryBuilder rather than routing through dense form #4949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I haven't run benchmarks yet, will post before/after numbers for a representative data set (with a lot of dictionary compression) when I can |
|
Gave this a rough review and it looks good to me. |
cpp/src/arrow/array/builder_dict.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I ran into brittleness with the way the template specialization was working in this file. I refactored things to more explicitly use std::enable_if
|
This should be pretty close... I'm going to merge this once I get the build passing so I am not stacking patches (of course, I will happily address further feedback in subsequent patches) |
Add methods to DictionaryBuilder to insert dictionary memo values and indices independently Prototype writing dictionary values / indices directly [skip ci]
|
Travis CI build: https://travis-ci.org/wesm/arrow/builds/564234236 |
hatemhelal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me a few days to get to reviewing this. Overall, looking neater but I had some minor questions / comments.
| }; | ||
|
|
||
| typedef std::function<void(int, std::shared_ptr<::DataType>*, std::shared_ptr<Array>*)> | ||
| typedef std::function<void(int, std::shared_ptr<DataType>*, std::shared_ptr<Array>*)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a copy from line 2015? I think this could be written as a using alias which I find easier to read.
| TransferFunctor<ArrowType, ParquetType> func; \ | ||
| RETURN_NOT_OK(func(record_reader_.get(), pool_, value_type, &result)); | ||
|
|
||
| #define TRANSFER_CASE(ENUM, ArrowType, ParquetType) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this macro can be removed.
| return Status::OK(); | ||
| } | ||
|
|
||
| std::shared_ptr<Field> ToDictionary32(const Field& field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to do some refactoring related to the schema conversion and instantiation of column readers, will return to this then
| /// \param column_indices indices of leaf nodes in parquet schema tree. Appearing ordering | ||
| /// matters for the converted schema. Repeated indices are ignored | ||
| /// except for the first one | ||
| /// \param properties reader options for FileReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the properties parameter is unused - is there a reason to prefer adding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: above, the way that dictionary-encoded fields are handled in reader.cc right now is very hacky, I started adding logic to schema.cc to create DictionaryType instances and found that I was blocked by the current structure of reader.cc. I will return to this
|
|
||
| #include "arrow/util/logging.h" | ||
|
|
||
| #include "parquet/arrow/reader.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this relates to my comment on the additional parameter in FromParquetSchema. Having the writer depend on the reader at first-glance appears as an anti-pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clean up includes, this was about getting access to ArrowReaderProperties
| // will not work for arbitrary nested data | ||
| int current_column_idx = row_group_writer_->current_column(); | ||
| std::shared_ptr<::arrow::Schema> arrow_schema; | ||
| RETURN_NOT_OK(FromParquetSchema(writer_->schema(), {current_column_idx - 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be avoided by using the arrow::Schema property on the FileWriter? I read this as we are recomputing the mapping from parquet to arrow when we already have the original schema passed in to create the writer instance. Eliminating this would resolve the unexpected coupling between the writer code on the reader.
With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving the explicit call to DictionaryEncode() redundant @wesm Closes #5299 from bkietz/6434-Crossbow-Nightly-HDFS-int and squashes the following commits: b29e6b7 <Benjamin Kietzman> don't try to dictionary encode dictionary arrays Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>
I sort of ran into a hill of fire ants with this. There is too
much to do in a single patch so I'm going to list out some of the
things here and the follow up items that will have to be dealt
with expediently in follow up patches:
(
InsertMemoValues) and indices (AppendIndicesasint64_t*) separatelyGetBatchSpacedtoRleDecoderdictionary indices are appended directly to DictionaryBuilder
rather than being fully decoded to dense, then rehashed. So
this will both save memory and improve performance by skipping
hashing
will be appended to the existing dictionary
"flushed" since the order of the dictionary is changed. Note that this is
only possible now due to ARROW-3144
have different dictionary index types. This is a bug. Rather than explode the
size of this patch I opened ARROW-6042 to add an exclusively int32-index
dictionary builder so we can fix this as follow up
structure of the column readers with respect to nested data is very messy,
and as a result it's not possible to do direct dictionary decoding inside a
nested subfield. We will have to fix this in the course of refactoring for
improved nested data support
There is much follow up work to do here, so I will get busy with with the other things while this is being reviewed.