Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jul 26, 2019

I sort of ran into a hill of fire ants with this. There is too
much to do in a single patch so I'm going to list out some of the
things here and the follow up items that will have to be dealt
with expediently in follow up patches:

  • Adds APIs to DictionaryBuilder to be able to insert dictionary memo values
    (InsertMemoValues) and indices (AppendIndices as int64_t*) separately
  • Adds GetBatchSpaced to RleDecoder
  • When setting "read_dictionary" for an Arrow column, now
    dictionary indices are appended directly to DictionaryBuilder
    rather than being fully decoded to dense, then rehashed. So
    this will both save memory and improve performance by skipping
    hashing
  • Dictionary fallback is no problem now because subsequent non-encoded pages
    will be appended to the existing dictionary
  • Each time a new dictionary page is encountered, the DictionaryBuilder is
    "flushed" since the order of the dictionary is changed. Note that this is
    only possible now due to ARROW-3144
  • Now that each row group can yield a different dictionary, they can possibly
    have different dictionary index types. This is a bug. Rather than explode the
    size of this patch I opened ARROW-6042 to add an exclusively int32-index
    dictionary builder so we can fix this as follow up
  • The handling of dictionaries in parquet/arrow/reader.cc is a mess -- the
    structure of the column readers with respect to nested data is very messy,
    and as a result it's not possible to do direct dictionary decoding inside a
    nested subfield. We will have to fix this in the course of refactoring for
    improved nested data support
  • Refactoring in parquet/arrow/reader.cc for improved readability

There is much follow up work to do here, so I will get busy with with the other things while this is being reviewed.

@wesm
Copy link
Member Author

wesm commented Jul 26, 2019

cc @hatemhelal @xhochy

@wesm
Copy link
Member Author

wesm commented Jul 26, 2019

I haven't run benchmarks yet, will post before/after numbers for a representative data set (with a lot of dictionary compression) when I can

@xhochy
Copy link
Member

xhochy commented Jul 26, 2019

Gave this a rough review and it looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I ran into brittleness with the way the template specialization was working in this file. I refactored things to more explicitly use std::enable_if

@wesm
Copy link
Member Author

wesm commented Jul 26, 2019

This should be pretty close... I'm going to merge this once I get the build passing so I am not stacking patches (of course, I will happily address further feedback in subsequent patches)

@wesm
Copy link
Member Author

wesm commented Jul 26, 2019

Travis CI build: https://travis-ci.org/wesm/arrow/builds/564234236

@wesm wesm closed this in 38b0176 Jul 26, 2019
@wesm wesm deleted the ARROW-3772 branch July 26, 2019 23:56
Copy link
Contributor

@hatemhelal hatemhelal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a few days to get to reviewing this. Overall, looking neater but I had some minor questions / comments.

};

typedef std::function<void(int, std::shared_ptr<::DataType>*, std::shared_ptr<Array>*)>
typedef std::function<void(int, std::shared_ptr<DataType>*, std::shared_ptr<Array>*)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a copy from line 2015? I think this could be written as a using alias which I find easier to read.

TransferFunctor<ArrowType, ParquetType> func; \
RETURN_NOT_OK(func(record_reader_.get(), pool_, value_type, &result));

#define TRANSFER_CASE(ENUM, ArrowType, ParquetType) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this macro can be removed.

return Status::OK();
}

std::shared_ptr<Field> ToDictionary32(const Field& field) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to do some refactoring related to the schema conversion and instantiation of column readers, will return to this then

/// \param column_indices indices of leaf nodes in parquet schema tree. Appearing ordering
/// matters for the converted schema. Repeated indices are ignored
/// except for the first one
/// \param properties reader options for FileReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the properties parameter is unused - is there a reason to prefer adding it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: above, the way that dictionary-encoded fields are handled in reader.cc right now is very hacky, I started adding logic to schema.cc to create DictionaryType instances and found that I was blocked by the current structure of reader.cc. I will return to this


#include "arrow/util/logging.h"

#include "parquet/arrow/reader.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this relates to my comment on the additional parameter in FromParquetSchema. Having the writer depend on the reader at first-glance appears as an anti-pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clean up includes, this was about getting access to ArrowReaderProperties

// will not work for arbitrary nested data
int current_column_idx = row_group_writer_->current_column();
std::shared_ptr<::arrow::Schema> arrow_schema;
RETURN_NOT_OK(FromParquetSchema(writer_->schema(), {current_column_idx - 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be avoided by using the arrow::Schema property on the FileWriter? I read this as we are recomputing the mapping from parquet to arrow when we already have the original schema passed in to create the writer instance. Eliminating this would resolve the unexpected coupling between the writer code on the reader.

kszucs pushed a commit that referenced this pull request Sep 6, 2019
With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving the explicit call to DictionaryEncode() redundant @wesm

Closes #5299 from bkietz/6434-Crossbow-Nightly-HDFS-int and squashes the following commits:

b29e6b7 <Benjamin Kietzman> don't try to dictionary encode dictionary arrays

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants