-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3246: [C++][Python][Parquet] Direct writing of DictionaryArray to Parquet columns, automatic decoding to Arrow #5077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks like I have an ASAN failure and some compiler warnings on Windows. Will investigate tomorrow |
|
OK I think I've fixed the issues that popped up. I know there's a lot to review here, so we may have to leave many of the refinements to follow up work |
hatemhelal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is coming together nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worth adding a check for ARROW_IPC as a cmake option dependency of ARROW_PARQUET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to fix the indices to arrow::int32 types. Is that a short-term limitation or is there something more basic that requires this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't (can't, in general) know the cardinality of the dictionary up front. Further, the cardinality could be different from row group to row group or file to file, so if you allow the indices to be the smallest type possible, you may end up with a bunch of arrays with different index types. That was the rationale for adding a DictionaryBuilder variant that always returns int32 indices 089e3db#diff-e15ddea2c1937474e62615fb906f6d97
There might be some mechanism we could explore in the future to allow DictionaryArrays in a ChunkedArray to each have different index types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes more sense to me now.
cpp/src/parquet/arrow/writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see this go.
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to look at where this is called to understand what it did. Perhaps rename to MaterializeDenseFromDictionary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming ConvertDictionaryToDense
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to move the call to CheckDictionarySizeLimit into the CommitWriteAndCheckPageLimit function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have it there because the data page limit logic is needed in the direct dictionary write path, where we don't want to fall back to plain encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also check for Encoding::RLE_DICTIONARY used in v2 or possibly reuse this logic from column_reader.cc:
// PLAIN_DICTIONARY is deprecated but used to be used as a dictionary index
// encoding.
static bool IsDictionaryIndexEncoding(const Encoding::type& e) {
return e == Encoding::RLE_DICTIONARY || e == Encoding::PLAIN_DICTIONARY;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we create the encoder it's always PLAIN_DICTIONARY
I'll factor out this check into a function anyway for readability
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the conditions that we might stop? I thought this would by-pass the fallback logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically you can feed this function a mix of DictionaryArray and non-DictionaryArray and it will encode them happily. I can add a comment clarifying
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The earlier comment makes sense now.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of this implementation appears to be a duplicate of PlainEncoder<ByteArrayType>::Put, could you refactor this into a shared impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see what I can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
hatemhelal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, thanks for the careful explanation and looking forward to trying this out!
|
See before and after benchmarks: https://gist.github.com/wesm/b9e917b46536676d2ed829bc4458ba3d Summary, for this synthetic example
|
|
The Appveyor failure in the R build seems legitimate, this is annoying: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26723020/job/w459ys9kd4838qlt |
|
https://github.com/apache/arrow/pull/5077/files#diff-66bf5c14b0dc83dceaebb3fe0ddbb55cR567 You might have the same issue here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write (or point to an existing) test which is identical but doesn't set store_schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for pointing this out, definitely makes the test "stronger"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment summarizing this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a doozy. I'll add some more comments throughout
cpp/src/parquet/statistics.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline is redundant for templates
| inline std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics( | |
| std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. Is this also true for out-of-class implementations of template class methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC templates are always inline unless explicitly marked with extern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm there seems to be mixed information about this https://stackoverflow.com/a/10536588/776560. I'm curious to know all the different cases but I'll remove this inline here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
cpp/src/parquet/types.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| struct TypeClasses {}; | |
| struct StatisticsTraits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended this to provide mappings to other classes, too. I'm not sure it's needed at all now since I removed some of the subclasses that made me add this in the first place. Will take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely removed this type traits, it was only needed in an intermediate state of this patch
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (dynamic_cast<const arrow::BinaryArray*>(&values) == nullptr) { | |
| if (values->type_id() == Type::BINARY || values->type_id() == Type::STRING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/arrow/writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // primitive type | |
| // primitive type | |
| if (array.dict_type()->index_type()->id() != Type::INT32) { | |
| return Status::TypeError("Writing DictionaryArray with non int32 indices"); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually any signed integer type is supported
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a test which checks this last path, could you add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there
$ gdb --args debug/parquet-arrow-test --gtest_filter=*ChangingDictionaries
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from debug/parquet-arrow-test...done.
(gdb) b src/parquet/column_writer.cc:1055
No source file named src/parquet/column_writer.cc.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (src/parquet/column_writer.cc:1055) pending.
(gdb) r
Starting program: /home/wesm/code/arrow/cpp/build/debug/parquet-arrow-test --gtest_filter=\*ChangingDictionaries
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running main() from /home/conda/feedstock_root/build_artifacts/gtest_1551008230529/work/googletest/src/gtest_main.cc
Note: Google Test filter = *ChangingDictionaries
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowWriteDictionaries
[ RUN ] TestArrowWriteDictionaries.ChangingDictionaries
Breakpoint 1, parquet::TypedColumnWriterImpl<parquet::PhysicalType<(parquet::Type::type)6> >::WriteArrowDictionary (this=0x61400005b850, def_levels=0x7fffeeb9ac00, rep_levels=0x0, num_levels=50000, array=...,
ctx=0x612000048b98) at ../src/parquet/column_writer.cc:1055
1055 PARQUET_CATCH_NOT_OK(FallbackToPlainEncoding());
…with the store_schema option [skip ci]
…t dictionary reads
…nd of the data [skip ci]
|
Thank you for the code review! |
|
Will merge this as soon as the build is passing... |
With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving the explicit call to DictionaryEncode() redundant @wesm Closes #5299 from bkietz/6434-Crossbow-Nightly-HDFS-int and squashes the following commits: b29e6b7 <Benjamin Kietzman> don't try to dictionary encode dictionary arrays Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
There's a lot going of interconnected pieces in this patch, so let me try to unpack:
TypedColumnWriterImpl::WriteBatch/WriteBatchSpacedto utilize more common code and be more readableTypedEncoder<T>::Put(const arrow::Array&)and implement for BYTE_ARRAY so avoid having to first createByteArray*as before. This should improve write performance for regular binary data -- I will do some benchmarks to measure by how muchTypedStatistics<T>::Update(const arrow::Array&)and implement for BYTE_ARRAY. This is necessary to be able to update the statistics given directly-inserted Arrow data without serializationPutDictionaryandPutIndicesmethods onDictEncoder.PutDictionaryis only implemented for BYTE_ARRAY but can be easily generalized to more types (we should open a follow up JIRA for this)TypedColumnWriterImpl::WriteArrowDictionarythat writes dictionary values and indices directly into a DictEncoder. This circumvents the dictionary page size checks so that we will continue to callPutIndicesuntil a new dictionary is encountered or some non-dictionary data is written. Note that in master, dictionary encoding is turned off as soon as a threshold in dictionary size is reached, which is by default 1MB. So if you want to preserve exactly the original dictionary values (e.g. if you are roundtripping DictionaryArray, R factor, or pandas.Categorical), then we have to step around this threshold check in this narrow case.ArrowWriterProperties::store_schema()option which stores the Arrow schema used to create a Parquet file in a specialARROW:schemakey in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the futureI think that's most things. One end result of this is that
arrow::DictionaryArraytypes from C++ andpandas.Categoricaltypes coerced from pandas with string dictionary values will be accurately preserved end-to-end. With a little more work (which I think can be done in a follow up PR) we can support the rest of the Parquet physical types.This was a fairly ugly project and I've doubtlessly left some messes around that we should clean up, but perhaps in follow up patches.
I'll post some benchmarks later to assess improvements in read and write performance. In the case of writing dictionary-encoded strings the boost should be significant.