-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-43807: [C++][Python] Add UUID extension type conversion support to/from Parquet #45866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
149bbe8 to
300f5eb
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ part looks good to me (I can't speak for the python part though). Thanks!
cpp/src/parquet/arrow/schema.cc
Outdated
| // Apply metadata recursively to storage type | ||
| RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, inferred)); | ||
| inferred->field = inferred->field->WithType(origin_type); | ||
| } else if (inferred_type->id() == ::arrow::Type::FIXED_SIZE_BINARY && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These branches are growing longer while they look pretty similar. Not sure if we can refactor it a little bit to look nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent quite a bit of time rewriting the logic here...I think it's better than it was?
|
|
||
| { | ||
| // Parquet file contains Arrow schema. | ||
| // uuid will be interpreted as uuid() field even though extensions are not enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are below.
| return ::arrow::fixed_size_binary(physical_length); | ||
| case LogicalType::Type::UUID: | ||
| if (reader_properties.get_arrow_extensions_enabled()) { | ||
| return ::arrow::extension::uuid(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check physical_length here?
python/pyarrow/_dataset_parquet.pyx
Outdated
| page_checksum_verification : bool, default False | ||
| If True, verify the page checksum for each page read from the file. | ||
| arrow_extensions_enabled : bool, default False | ||
| If True, read Parquet logical types as Arrow Extension Types where possible, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments I posted on this on the Geometry PR.
| pa.table({"ext": pa.array(data, pa.string())}), | ||
| store_schema=False) | ||
|
|
||
| # With arrow_extensions_enabled=True on read, we get a arrow.uuid back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This the JSON test, not UUID.
|
Thanks for the reminder! I see the MacOS 13 tests failing on main as well, but I think this is ready for another review whenever time allows 🙂 |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I have left some comments to the refactoring part in schema.cc. Other parts look good to me!
cpp/src/parquet/arrow/schema.cc
Outdated
| } else if (origin_extension_name == "arrow.uuid") { | ||
| extension_supports_inferred_storage = | ||
| arrow_extension_inferred || | ||
| (inferred_type->id() == ::arrow::Type::FIXED_SIZE_BINARY && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to add UuidType::IsSupportedStorageType(const std::shared_ptr<::arrow::DataType>&) for this?
cpp/src/parquet/arrow/schema.cc
Outdated
| // i.e., arrow_extensions_enabled is true or arrow_extensions_enabled is false but | ||
| // we still restore the extension type because Arrow is the source of truth if we | ||
| // are asked to apply the original metadata | ||
| auto origin_storage_field = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto origin_storage_field =
origin_field.WithType(origin_extension_type.storage_type());
RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, inferred));
These lines are the same in the branches. Should we move them out?
cpp/src/parquet/arrow/schema.cc
Outdated
| arrow_extension_inferred || | ||
| VariantExtensionType::IsSupportedStorageType(inferred_type); | ||
| } else if (arrow_extension_inferred) { | ||
| extension_supports_inferred_storage = origin_extension_type.Equals(*inferred_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this branch? It will for sure fall into the if statement at line 1100.
| } else if (arrow_extension_inferred) { | ||
| extension_supports_inferred_storage = origin_extension_type.Equals(*inferred_type); | ||
| } else { | ||
| extension_supports_inferred_storage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above comment, do we need to check this? It will go to the else branch at line 1108 and the same check is performed there too.
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
… both if branches
|
Apologies for taking a week to circle back here...I think the main thing I had been missing was that |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great to me! Thanks!
|
Thanks! I'll merge tomorrow if there are no objections! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 75acf37. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
I forgot, do we have an issue open to enable extension types by default? |
|
I didn't find one through a quick search. Should I create one? You mean |
|
We did #44070 (comment), here's the corresponding issue: #44500. I'm happy to open a PR later this week. |
Rationale for this change
After #37298, the UUID canonical extension type is supported in Arrow C++ and PyArrow; however, it is not converted to the Parquet UUID type on write and is not inferred when on Parquet read.
What changes are included in this PR?
Are these changes tested?
Yes!
Are there any user-facing changes?
Yes! (Documentation will be added)