Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Aug 13, 2019

There's a lot going of interconnected pieces in this patch, so let me try to unpack:

  • Refactor TypedColumnWriterImpl::WriteBatch/WriteBatchSpaced to utilize more common code and be more readable
  • Add TypedEncoder<T>::Put(const arrow::Array&) and implement for BYTE_ARRAY so avoid having to first create ByteArray* as before. This should improve write performance for regular binary data -- I will do some benchmarks to measure by how much
  • Add TypedStatistics<T>::Update(const arrow::Array&) and implement for BYTE_ARRAY. This is necessary to be able to update the statistics given directly-inserted Arrow data without serialization
  • Implement PutDictionary and PutIndices methods on DictEncoder. PutDictionary is only implemented for BYTE_ARRAY but can be easily generalized to more types (we should open a follow up JIRA for this)
  • Implement internal TypedColumnWriterImpl::WriteArrowDictionary that writes dictionary values and indices directly into a DictEncoder. This circumvents the dictionary page size checks so that we will continue to call PutIndices until a new dictionary is encountered or some non-dictionary data is written. Note that in master, dictionary encoding is turned off as soon as a threshold in dictionary size is reached, which is by default 1MB. So if you want to preserve exactly the original dictionary values (e.g. if you are roundtripping DictionaryArray, R factor, or pandas.Categorical), then we have to step around this threshold check in this narrow case.
  • Add ArrowWriterProperties::store_schema() option which stores the Arrow schema used to create a Parquet file in a special ARROW:schema key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future

I think that's most things. One end result of this is that arrow::DictionaryArray types from C++ and pandas.Categorical types coerced from pandas with string dictionary values will be accurately preserved end-to-end. With a little more work (which I think can be done in a follow up PR) we can support the rest of the Parquet physical types.

This was a fairly ugly project and I've doubtlessly left some messes around that we should clean up, but perhaps in follow up patches.

I'll post some benchmarks later to assess improvements in read and write performance. In the case of writing dictionary-encoded strings the boost should be significant.

@wesm
Copy link
Member Author

wesm commented Aug 14, 2019

Looks like I have an ASAN failure and some compiler warnings on Windows. Will investigate tomorrow

@wesm
Copy link
Member Author

wesm commented Aug 14, 2019

OK I think I've fixed the issues that popped up. I know there's a lot to review here, so we may have to leave many of the refinements to follow up work

Copy link
Contributor

@hatemhelal hatemhelal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is coming together nicely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth adding a check for ARROW_IPC as a cmake option dependency of ARROW_PARQUET.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to fix the indices to arrow::int32 types. Is that a short-term limitation or is there something more basic that requires this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't (can't, in general) know the cardinality of the dictionary up front. Further, the cardinality could be different from row group to row group or file to file, so if you allow the indices to be the smallest type possible, you may end up with a bunch of arrays with different index types. That was the rationale for adding a DictionaryBuilder variant that always returns int32 indices 089e3db#diff-e15ddea2c1937474e62615fb906f6d97

There might be some mechanism we could explore in the future to allow DictionaryArrays in a ChunkedArray to each have different index types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes more sense to me now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look at where this is called to understand what it did. Perhaps rename to MaterializeDenseFromDictionary ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming ConvertDictionaryToDense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to move the call to CheckDictionarySizeLimit into the CommitWriteAndCheckPageLimit function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have it there because the data page limit logic is needed in the direct dictionary write path, where we don't want to fall back to plain encoding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also check for Encoding::RLE_DICTIONARY used in v2 or possibly reuse this logic from column_reader.cc:

// PLAIN_DICTIONARY is deprecated but used to be used as a dictionary index
// encoding.
static bool IsDictionaryIndexEncoding(const Encoding::type& e) {
  return e == Encoding::RLE_DICTIONARY || e == Encoding::PLAIN_DICTIONARY;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we create the encoder it's always PLAIN_DICTIONARY

https://github.com/apache/arrow/blob/apache-arrow-0.14.1/cpp/src/parquet/encoding.cc#L355

I'll factor out this check into a function anyway for readability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the conditions that we might stop? I thought this would by-pass the fallback logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you can feed this function a mix of DictionaryArray and non-DictionaryArray and it will encode them happily. I can add a comment clarifying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier comment makes sense now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this implementation appears to be a duplicate of PlainEncoder<ByteArrayType>::Put, could you refactor this into a shared impl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@hatemhelal hatemhelal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, thanks for the careful explanation and looking forward to trying this out!

@wesm
Copy link
Member Author

wesm commented Aug 15, 2019

See before and after benchmarks:

https://gist.github.com/wesm/b9e917b46536676d2ed829bc4458ba3d

Summary, for this synthetic example

  • Write performance of DictionaryArray improved 4x from 800ms to 200ms
  • Peak memory use when writing drops from ~700MB to ~50MB because a cast to dense is no longer required
  • Read performance (comparing direct dictionary read vs. read-as-dense) improved 5x from ~500ms to ~100ms
  • Writing of regular BinaryArray as BYTE_ARRAY improved by ~15-20% (see "dense_data" write at the bottom of each benchmark)

@wesm
Copy link
Member Author

wesm commented Aug 15, 2019

The Appveyor failure in the R build seems legitimate, this is annoying:

C:/projects/arrow/cpp/src/parquet/arrow/reader_internal.cc: In function 'arrow::Status parquet::arrow::GetOriginSchema(const std::shared_ptr<const arrow::KeyValueMetadata>&, std::shared_ptr<const arrow::KeyValueMetadata>*, std::shared_ptr<arrow::Schema>*)':
C:/projects/arrow/cpp/src/parquet/arrow/reader_internal.cc:567:55: error: converting to 'const std::unordered_map<std::basic_string<char>, std::basic_string<char> >' from initializer list would use explicit constructor 'std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = std::basic_string<char>; _Tp = std::basic_string<char>; _Hash = std::hash<std::basic_string<char> >; _Pred = std::equal_to<std::basic_string<char> >; _Alloc = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >]'
     auto new_metadata = ::arrow::key_value_metadata({});
                                                       ^
make[2]: *** [src/parquet/CMakeFiles/parquet_static.dir/build.make:90: src/parquet/CMakeFiles/parquet_static.dir/arrow/reader_internal.cc.obj] Error 1
make[2]: *** Waiting for unfinished jobs....
C:/projects/arrow/cpp/src/parquet/arrow/writer.cc: In function 'arrow::Status parquet::arrow::GetSchemaMetadata(const arrow::Schema&, arrow::MemoryPool*, const parquet::ArrowWriterProperties&, std::shared_ptr<const arrow::KeyValueMetadata>*)':
C:/projects/arrow/cpp/src/parquet/arrow/writer.cc:569:44: error: converting to 'const std::unordered_map<std::basic_string<char>, std::basic_string<char> >' from initializer list would use explicit constructor 'std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = std::basic_string<char>; _Tp = std::basic_string<char>; _Hash = std::hash<std::basic_string<char> >; _Pred = std::equal_to<std::basic_string<char> >; _Alloc = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >]'
     result = ::arrow::key_value_metadata({});
                                            ^
make[2]: *** [src/parquet/CMakeFiles/parquet_static.dir/build.make:116: src/parquet/CMakeFiles/parquet_static.dir/arrow/writer.cc.obj] Error 1
make[1]: *** [CMakeFiles/Makefile2:1265: src/parquet/CMakeFiles/parquet_static.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
==> ERROR: A failure occurred in build().
    Aborting...

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26723020/job/w459ys9kd4838qlt

@bkietz
Copy link
Member

bkietz commented Aug 15, 2019

https://github.com/apache/arrow/pull/5077/files#diff-66bf5c14b0dc83dceaebb3fe0ddbb55cR567
I'd recommend building new_metadata as an unordered_map then calling key_value_metadata to wrap it after it's built.

You might have the same issue here:
https://github.com/apache/arrow/pull/5077/files#diff-806bd9c3d77823ae1bff914269e7db02R569

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write (or point to an existing) test which is identical but doesn't set store_schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for pointing this out, definitely makes the test "stronger"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment summarizing this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a doozy. I'll add some more comments throughout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is redundant for templates

Suggested change
inline std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics(
std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Is this also true for out-of-class implementations of template class methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC templates are always inline unless explicitly marked with extern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm there seems to be mixed information about this https://stackoverflow.com/a/10536588/776560. I'm curious to know all the different cases but I'll remove this inline here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct TypeClasses {};
struct StatisticsTraits;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended this to provide mappings to other classes, too. I'm not sure it's needed at all now since I removed some of the subclasses that made me add this in the first place. Will take a look

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely removed this type traits, it was only needed in an intermediate state of this patch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (dynamic_cast<const arrow::BinaryArray*>(&values) == nullptr) {
if (values->type_id() == Type::BINARY || values->type_id() == Type::STRING) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// primitive type
// primitive type
if (array.dict_type()->index_type()->id() != Type::INT32) {
return Status::TypeError("Writing DictionaryArray with non int32 indices");
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually any signed integer type is supported

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a test which checks this last path, could you add one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there

$ gdb --args debug/parquet-arrow-test --gtest_filter=*ChangingDictionaries
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from debug/parquet-arrow-test...done.
(gdb) b src/parquet/column_writer.cc:1055
No source file named src/parquet/column_writer.cc.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (src/parquet/column_writer.cc:1055) pending.
(gdb) r
Starting program: /home/wesm/code/arrow/cpp/build/debug/parquet-arrow-test --gtest_filter=\*ChangingDictionaries
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running main() from /home/conda/feedstock_root/build_artifacts/gtest_1551008230529/work/googletest/src/gtest_main.cc
Note: Google Test filter = *ChangingDictionaries
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestArrowWriteDictionaries
[ RUN      ] TestArrowWriteDictionaries.ChangingDictionaries

Breakpoint 1, parquet::TypedColumnWriterImpl<parquet::PhysicalType<(parquet::Type::type)6> >::WriteArrowDictionary (this=0x61400005b850, def_levels=0x7fffeeb9ac00, rep_levels=0x0, num_levels=50000, array=..., 
    ctx=0x612000048b98) at ../src/parquet/column_writer.cc:1055
1055	    PARQUET_CATCH_NOT_OK(FallbackToPlainEncoding());

@wesm
Copy link
Member Author

wesm commented Aug 16, 2019

Thank you for the code review!

@wesm
Copy link
Member Author

wesm commented Aug 16, 2019

Will merge this as soon as the build is passing...

@wesm
Copy link
Member Author

wesm commented Aug 16, 2019

@wesm wesm closed this in 2ba0566 Aug 16, 2019
@wesm wesm deleted the ARROW-3246 branch August 16, 2019 13:54
kszucs pushed a commit that referenced this pull request Sep 6, 2019
With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving the explicit call to DictionaryEncode() redundant @wesm

Closes #5299 from bkietz/6434-Crossbow-Nightly-HDFS-int and squashes the following commits:

b29e6b7 <Benjamin Kietzman> don't try to dictionary encode dictionary arrays

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
nealrichardson pushed a commit that referenced this pull request Jan 8, 2020
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`.

The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same.

Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits:

9227e7e <Hiroaki Yutani> Fix test
4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties()
dfd08cb <Hiroaki Yutani> Add failing tests

Authored-by: Hiroaki Yutani <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants