ARROW-3246: [C++][Python][Parquet] Direct writing of DictionaryArray to Parquet columns, automatic decoding to Arrow #5077

wesm · 2019-08-13T19:56:32Z

There's a lot going of interconnected pieces in this patch, so let me try to unpack:

Refactor TypedColumnWriterImpl::WriteBatch/WriteBatchSpaced to utilize more common code and be more readable
Add TypedEncoder<T>::Put(const arrow::Array&) and implement for BYTE_ARRAY so avoid having to first create ByteArray* as before. This should improve write performance for regular binary data -- I will do some benchmarks to measure by how much
Add TypedStatistics<T>::Update(const arrow::Array&) and implement for BYTE_ARRAY. This is necessary to be able to update the statistics given directly-inserted Arrow data without serialization
Implement PutDictionary and PutIndices methods on DictEncoder. PutDictionary is only implemented for BYTE_ARRAY but can be easily generalized to more types (we should open a follow up JIRA for this)
Implement internal TypedColumnWriterImpl::WriteArrowDictionary that writes dictionary values and indices directly into a DictEncoder. This circumvents the dictionary page size checks so that we will continue to call PutIndices until a new dictionary is encountered or some non-dictionary data is written. Note that in master, dictionary encoding is turned off as soon as a threshold in dictionary size is reached, which is by default 1MB. So if you want to preserve exactly the original dictionary values (e.g. if you are roundtripping DictionaryArray, R factor, or pandas.Categorical), then we have to step around this threshold check in this narrow case.
Add ArrowWriterProperties::store_schema() option which stores the Arrow schema used to create a Parquet file in a special ARROW:schema key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future

I think that's most things. One end result of this is that arrow::DictionaryArray types from C++ and pandas.Categorical types coerced from pandas with string dictionary values will be accurately preserved end-to-end. With a little more work (which I think can be done in a follow up PR) we can support the rest of the Parquet physical types.

This was a fairly ugly project and I've doubtlessly left some messes around that we should clean up, but perhaps in follow up patches.

I'll post some benchmarks later to assess improvements in read and write performance. In the case of writing dictionary-encoded strings the boost should be significant.

wesm · 2019-08-14T04:45:24Z

Looks like I have an ASAN failure and some compiler warnings on Windows. Will investigate tomorrow

wesm · 2019-08-14T19:27:30Z

OK I think I've fixed the issues that popped up. I know there's a lot to review here, so we may have to leave many of the refinements to follow up work

hatemhelal

Looks like this is coming together nicely.

hatemhelal · 2019-08-15T06:20:16Z

cpp/src/parquet/arrow/reader_internal.cc

It would be worth adding a check for ARROW_IPC as a cmake option dependency of ARROW_PARQUET.

hatemhelal · 2019-08-15T06:24:33Z

cpp/src/parquet/arrow/reader_internal.cc

This appears to fix the indices to arrow::int32 types. Is that a short-term limitation or is there something more basic that requires this?

We don't (can't, in general) know the cardinality of the dictionary up front. Further, the cardinality could be different from row group to row group or file to file, so if you allow the indices to be the smallest type possible, you may end up with a bunch of arrays with different index types. That was the rationale for adding a DictionaryBuilder variant that always returns int32 indices 089e3db#diff-e15ddea2c1937474e62615fb906f6d97

There might be some mechanism we could explore in the future to allow DictionaryArrays in a ChunkedArray to each have different index types

Thanks, that makes more sense to me now.

hatemhelal · 2019-08-15T06:26:26Z

cpp/src/parquet/arrow/writer.cc

Nice to see this go.

hatemhelal · 2019-08-15T10:36:51Z

cpp/src/parquet/column_writer.cc

I had to look at where this is called to understand what it did. Perhaps rename to MaterializeDenseFromDictionary ?

Renaming ConvertDictionaryToDense

hatemhelal · 2019-08-15T10:42:35Z

cpp/src/parquet/column_writer.cc

I think it makes sense to move the call to CheckDictionarySizeLimit into the CommitWriteAndCheckPageLimit function.

We can't have it there because the data page limit logic is needed in the direct dictionary write path, where we don't want to fall back to plain encoding.

Added a comment.

hatemhelal · 2019-08-15T10:46:55Z

cpp/src/parquet/column_writer.cc

I think this should also check for Encoding::RLE_DICTIONARY used in v2 or possibly reuse this logic from column_reader.cc:

// PLAIN_DICTIONARY is deprecated but used to be used as a dictionary index // encoding. static bool IsDictionaryIndexEncoding(const Encoding::type& e) { return e == Encoding::RLE_DICTIONARY || e == Encoding::PLAIN_DICTIONARY; }

When we create the encoder it's always PLAIN_DICTIONARY

https://github.com/apache/arrow/blob/apache-arrow-0.14.1/cpp/src/parquet/encoding.cc#L355

I'll factor out this check into a function anyway for readability

hatemhelal · 2019-08-15T10:51:32Z

cpp/src/parquet/column_writer.cc

What are the conditions that we might stop? I thought this would by-pass the fallback logic.

Technically you can feed this function a mix of DictionaryArray and non-DictionaryArray and it will encode them happily. I can add a comment clarifying

hatemhelal · 2019-08-15T10:54:50Z

cpp/src/parquet/column_writer.cc

The earlier comment makes sense now.

hatemhelal · 2019-08-15T11:10:14Z

cpp/src/parquet/encoding.cc

The body of this implementation appears to be a duplicate of PlainEncoder<ByteArrayType>::Put, could you refactor this into a shared impl?

I'll see what I can do

hatemhelal

Changes look good, thanks for the careful explanation and looking forward to trying this out!

wesm · 2019-08-15T16:13:21Z

See before and after benchmarks:

https://gist.github.com/wesm/b9e917b46536676d2ed829bc4458ba3d

Summary, for this synthetic example

Write performance of DictionaryArray improved 4x from 800ms to 200ms
Peak memory use when writing drops from ~700MB to ~50MB because a cast to dense is no longer required
Read performance (comparing direct dictionary read vs. read-as-dense) improved 5x from ~500ms to ~100ms
Writing of regular BinaryArray as BYTE_ARRAY improved by ~15-20% (see "dense_data" write at the bottom of each benchmark)

wesm · 2019-08-15T16:48:18Z

The Appveyor failure in the R build seems legitimate, this is annoying:

C:/projects/arrow/cpp/src/parquet/arrow/reader_internal.cc: In function 'arrow::Status parquet::arrow::GetOriginSchema(const std::shared_ptr<const arrow::KeyValueMetadata>&, std::shared_ptr<const arrow::KeyValueMetadata>*, std::shared_ptr<arrow::Schema>*)':
C:/projects/arrow/cpp/src/parquet/arrow/reader_internal.cc:567:55: error: converting to 'const std::unordered_map<std::basic_string<char>, std::basic_string<char> >' from initializer list would use explicit constructor 'std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = std::basic_string<char>; _Tp = std::basic_string<char>; _Hash = std::hash<std::basic_string<char> >; _Pred = std::equal_to<std::basic_string<char> >; _Alloc = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >]'
     auto new_metadata = ::arrow::key_value_metadata({});
                                                       ^
make[2]: *** [src/parquet/CMakeFiles/parquet_static.dir/build.make:90: src/parquet/CMakeFiles/parquet_static.dir/arrow/reader_internal.cc.obj] Error 1
make[2]: *** Waiting for unfinished jobs....
C:/projects/arrow/cpp/src/parquet/arrow/writer.cc: In function 'arrow::Status parquet::arrow::GetSchemaMetadata(const arrow::Schema&, arrow::MemoryPool*, const parquet::ArrowWriterProperties&, std::shared_ptr<const arrow::KeyValueMetadata>*)':
C:/projects/arrow/cpp/src/parquet/arrow/writer.cc:569:44: error: converting to 'const std::unordered_map<std::basic_string<char>, std::basic_string<char> >' from initializer list would use explicit constructor 'std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = std::basic_string<char>; _Tp = std::basic_string<char>; _Hash = std::hash<std::basic_string<char> >; _Pred = std::equal_to<std::basic_string<char> >; _Alloc = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<std::basic_string<char> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const std::basic_string<char>, std::basic_string<char> > >]'
     result = ::arrow::key_value_metadata({});
                                            ^
make[2]: *** [src/parquet/CMakeFiles/parquet_static.dir/build.make:116: src/parquet/CMakeFiles/parquet_static.dir/arrow/writer.cc.obj] Error 1
make[1]: *** [CMakeFiles/Makefile2:1265: src/parquet/CMakeFiles/parquet_static.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
==> ERROR: A failure occurred in build().
    Aborting...

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26723020/job/w459ys9kd4838qlt

bkietz · 2019-08-15T16:58:11Z

https://github.com/apache/arrow/pull/5077/files#diff-66bf5c14b0dc83dceaebb3fe0ddbb55cR567
I'd recommend building new_metadata as an unordered_map then calling key_value_metadata to wrap it after it's built.

You might have the same issue here:
https://github.com/apache/arrow/pull/5077/files#diff-806bd9c3d77823ae1bff914269e7db02R569

cpp/src/arrow/array/builder_dict.cc

cpp/src/parquet/arrow/arrow-reader-writer-test.cc

bkietz · 2019-08-15T16:48:02Z

cpp/src/parquet/arrow/arrow-reader-writer-test.cc

Could you write (or point to an existing) test which is identical but doesn't set store_schema?

Yes, thanks for pointing this out, definitely makes the test "stronger"

bkietz · 2019-08-15T16:50:09Z

cpp/src/parquet/arrow/reader_internal.cc

Could you add a comment summarizing this function?

It's a doozy. I'll add some more comments throughout

cpp/src/parquet/arrow/reader_internal.cc

bkietz · 2019-08-15T18:45:28Z

cpp/src/parquet/statistics.h

inline is redundant for templates

Suggested change

inline std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics(

std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics(

Noted. Is this also true for out-of-class implementations of template class methods?

IIRC templates are always inline unless explicitly marked with extern

Hm there seems to be mixed information about this https://stackoverflow.com/a/10536588/776560. I'm curious to know all the different cases but I'll remove this inline here

bkietz · 2019-08-15T18:50:37Z

cpp/src/parquet/types.h

Suggested change

struct TypeClasses {};

struct StatisticsTraits;

I intended this to provide mappings to other classes, too. I'm not sure it's needed at all now since I removed some of the subclasses that made me add this in the first place. Will take a look

I completely removed this type traits, it was only needed in an intermediate state of this patch

bkietz · 2019-08-15T18:55:05Z

cpp/src/parquet/encoding.cc

Suggested change

if (dynamic_cast<const arrow::BinaryArray*>(&values) == nullptr) {

if (values->type_id() == Type::BINARY || values->type_id() == Type::STRING) {

bkietz · 2019-08-15T18:59:18Z

cpp/src/parquet/arrow/writer.cc

Suggested change

// primitive type

// primitive type

if (array.dict_type()->index_type()->id() != Type::INT32) {

return Status::TypeError("Writing DictionaryArray with non int32 indices");

}

Actually any signed integer type is supported

bkietz · 2019-08-15T19:01:21Z

cpp/src/parquet/column_writer.cc

I don't see a test which checks this last path, could you add one?

It's there

$ gdb --args debug/parquet-arrow-test --gtest_filter=*ChangingDictionaries GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from debug/parquet-arrow-test...done. (gdb) b src/parquet/column_writer.cc:1055 No source file named src/parquet/column_writer.cc. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (src/parquet/column_writer.cc:1055) pending. (gdb) r Starting program: /home/wesm/code/arrow/cpp/build/debug/parquet-arrow-test --gtest_filter=\*ChangingDictionaries [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Running main() from /home/conda/feedstock_root/build_artifacts/gtest_1551008230529/work/googletest/src/gtest_main.cc Note: Google Test filter = *ChangingDictionaries [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from TestArrowWriteDictionaries [ RUN ] TestArrowWriteDictionaries.ChangingDictionaries Breakpoint 1, parquet::TypedColumnWriterImpl<parquet::PhysicalType<(parquet::Type::type)6> >::WriteArrowDictionary (this=0x61400005b850, def_levels=0x7fffeeb9ac00, rep_levels=0x0, num_levels=50000, array=..., ctx=0x612000048b98) at ../src/parquet/column_writer.cc:1055 1055 PARQUET_CATCH_NOT_OK(FallbackToPlainEncoding());

…p ci]

…with the store_schema option [skip ci]

…t dictionary reads

…nd of the data [skip ci]

wesm · 2019-08-16T03:26:05Z

Thank you for the code review!

wesm · 2019-08-16T12:01:00Z

Will merge this as soon as the build is passing...

wesm · 2019-08-16T13:48:03Z

Travis CI: https://travis-ci.org/wesm/arrow/builds/572728109
Appveyor: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26746111

@wesm

With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving the explicit call to DictionaryEncode() redundant @wesm Closes #5299 from bkietz/6434-Crossbow-Nightly-HDFS-int and squashes the following commits: b29e6b7 <Benjamin Kietzman> don't try to dictionary encode dictionary arrays Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>

The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

wesm force-pushed the ARROW-3246 branch from b1ffbdd to f4892fa Compare August 14, 2019 18:28

hatemhelal reviewed Aug 15, 2019

View reviewed changes

wesm force-pushed the ARROW-3246 branch from 6e8e694 to 4456a0a Compare August 15, 2019 13:26

hatemhelal approved these changes Aug 15, 2019

View reviewed changes

bkietz requested changes Aug 15, 2019

View reviewed changes

wesm added 20 commits August 15, 2019 21:49

Refactor WriteBatch/WriteBatchSpaced to utilize helper functions [ski…

c871ea9

…p ci]

Refactor and add Arrow encoder stubs

c4d7dc2

ByteArray statistics specializations [skip ci]

245f445

TypedComparator/TypedStatistics augmentations for arrow::BinaryArray

882e434

More direct encoding implementation stubs

1264e10

Direct binary put works

57a45e0

Fix unit tests [skip ci]

edc9f84

Get all direct put unit tests passing [skip ci]

d21ebd8

Temp [skip ci]

5aaf281

Unit test for PutDictionary, PutIndices

8cc1bcf

More scaffolding [skip ci]

0a293ee

Closer to full dictionary write, NA test failing

1380531

Fix null dictionary test, unit tests passing again [skip ci]

de9d0a5

Add unit test for writing changing dictionaries

28268d6

Add failing unit test for Arrow store schema option [skip ci]

580a0ca

Automatically read dictionary fields by serializing the Arrow schema …

7705fdb

…with the store_schema option [skip ci]

Store schema when writing from Python, add unit test to exhibit direc…

7d663d5

…t dictionary reads

Fix up Python unit tests given schema serialization

f26d7da

Revert change causing ASAN failure [skip ci]

3425da4

Fix MSVC compilation warnings [skip ci]

5dc00b1

wesm added 7 commits August 15, 2019 21:49

Fix DecodeArrow bug which only occurred when there are nulls at the e…

8f4cd44

…nd of the data [skip ci]

Check more random seeds, fix warnings

3f45fef

Fix another new MSVC warning

91555b6

Code review feedback

92cf4e0

Use other KeyValueMetadata factory function to hopefully appease MinGW

494b954

Fix another KeyValueMetadata factory

7f3a2a8

Address code review comments

ad3bad3

wesm force-pushed the ARROW-3246 branch from 6442a5b to ad3bad3 Compare August 16, 2019 03:25

Restore statistics aliases

6b1769c

wesm closed this in 2ba0566 Aug 16, 2019

wesm deleted the ARROW-3246 branch August 16, 2019 13:54

ARF1 mentioned this pull request Sep 2, 2019

Failing to open arrow stream with categoricals vaexio/vaex#399

Closed

bkietz mentioned this pull request Sep 5, 2019

ARROW-6475: [C++] Don't try to dictionary encode dictionary arrays #5299

Closed

yutannihilation mentioned this pull request Jan 7, 2020

ARROW-7045: [R] Preserve factor in Parquet roundtrip #6135

Closed

eitsupi mentioned this pull request Apr 4, 2022

Pandas categoricals vs. R factors duckdb/duckdb#3309

Closed

2 tasks

This was referenced Feb 17, 2021

[Python][Parquet] direct reading/writing of pandas categoricals in parquet #19588

Closed

[C++] Don't try to dictionary encode dictionary arrays #22844

Closed

avimallu mentioned this pull request Jun 5, 2023

Converting to categorical increases Parquet file size 6 times pola-rs/polars#9246

Closed

2 tasks

wgtmac mentioned this pull request May 22, 2024

GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false #41769

Open

	inline std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics(
	std::shared_ptr<typename TypeClasses<DType>::Statistics> MakeStatistics(

	if (dynamic_cast<const arrow::BinaryArray*>(&values) == nullptr) {
	if (values->type_id() == Type::BINARY \|\| values->type_id() == Type::STRING) {

-    // primitive type
+    // primitive type
+    if (array.dict_type()->index_type()->id() != Type::INT32) {
+      return Status::TypeError("Writing DictionaryArray with non int32 indices");
+    }

ARROW-3246: [C++][Python][Parquet] Direct writing of DictionaryArray to Parquet columns, automatic decoding to Arrow #5077

ARROW-3246: [C++][Python][Parquet] Direct writing of DictionaryArray to Parquet columns, automatic decoding to Arrow #5077

Uh oh!

Conversation

wesm commented Aug 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Aug 14, 2019

Uh oh!

wesm commented Aug 14, 2019

Uh oh!

hatemhelal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hatemhelal left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkietz commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wesm commented Aug 13, 2019 •

edited

Loading

wesm commented Aug 15, 2019 •

edited

Loading

wesm commented Aug 15, 2019 •

edited

Loading

bkietz commented Aug 15, 2019 •

edited

Loading