Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Mar 20, 2025

Rationale for this change

After #37298, the UUID canonical extension type is supported in Arrow C++ and PyArrow; however, it is not converted to the Parquet UUID type on write and is not inferred when on Parquet read.

What changes are included in this PR?

  • Infer the Parquet UUID type from the Arrow UUID extension type
  • Infer the Arrow UUID type from Parquet UUID when arrow_extensions_enabled is set (like the JSON extension type)
  • Wire up arrow_extensions_enabled to pyarrow and add Python tests

Are these changes tested?

Yes!

Are there any user-facing changes?

Yes! (Documentation will be added)

@github-actions
Copy link

⚠️ GitHub issue #43807 has been automatically assigned in GitHub to PR creator.

@paleolimbot paleolimbot marked this pull request as ready for review March 25, 2025 14:47
@paleolimbot paleolimbot requested a review from wgtmac as a code owner March 25, 2025 14:47
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ part looks good to me (I can't speak for the python part though). Thanks!

// Apply metadata recursively to storage type
RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, inferred));
inferred->field = inferred->field->WithType(origin_type);
} else if (inferred_type->id() == ::arrow::Type::FIXED_SIZE_BINARY &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These branches are growing longer while they look pretty similar. Not sure if we can refactor it a little bit to look nicer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite a bit of time rewriting the logic here...I think it's better than it was?


{
// Parquet file contains Arrow schema.
// uuid will be interpreted as uuid() field even though extensions are not enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are below.

return ::arrow::fixed_size_binary(physical_length);
case LogicalType::Type::UUID:
if (reader_properties.get_arrow_extensions_enabled()) {
return ::arrow::extension::uuid();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check physical_length here?

page_checksum_verification : bool, default False
If True, verify the page checksum for each page read from the file.
arrow_extensions_enabled : bool, default False
If True, read Parquet logical types as Arrow Extension Types where possible,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments I posted on this on the Geometry PR.

pa.table({"ext": pa.array(data, pa.string())}),
store_schema=False)

# With arrow_extensions_enabled=True on read, we get a arrow.uuid back
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the JSON test, not UUID.

@github-actions github-actions bot removed the awaiting committer review Awaiting committer review label Apr 1, 2025
@paleolimbot
Copy link
Member Author

Thanks for the reminder! I see the MacOS 13 tests failing on main as well, but I think this is ready for another review whenever time allows 🙂

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I have left some comments to the refactoring part in schema.cc. Other parts look good to me!

} else if (origin_extension_name == "arrow.uuid") {
extension_supports_inferred_storage =
arrow_extension_inferred ||
(inferred_type->id() == ::arrow::Type::FIXED_SIZE_BINARY &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add UuidType::IsSupportedStorageType(const std::shared_ptr<::arrow::DataType>&) for this?

// i.e., arrow_extensions_enabled is true or arrow_extensions_enabled is false but
// we still restore the extension type because Arrow is the source of truth if we
// are asked to apply the original metadata
auto origin_storage_field =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      auto origin_storage_field =
          origin_field.WithType(origin_extension_type.storage_type());
      RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, inferred));

These lines are the same in the branches. Should we move them out?

arrow_extension_inferred ||
VariantExtensionType::IsSupportedStorageType(inferred_type);
} else if (arrow_extension_inferred) {
extension_supports_inferred_storage = origin_extension_type.Equals(*inferred_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this branch? It will for sure fall into the if statement at line 1100.

} else if (arrow_extension_inferred) {
extension_supports_inferred_storage = origin_extension_type.Equals(*inferred_type);
} else {
extension_supports_inferred_storage =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comment, do we need to check this? It will go to the else branch at line 1108 and the same check is performed there too.

@AlenkaF AlenkaF removed their request for review April 18, 2025 17:45
@paleolimbot
Copy link
Member Author

Apologies for taking a week to circle back here...I think the main thing I had been missing was that ApplyOriginalStorageMetadata(*origin_storage_field, inferred) invalidated auto& inferred_type, so I had been getting crashes when I tried to make the logic simpler. I think the result is much cleaner! I tried to overload the comments with the things I learned in the process.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great to me! Thanks!

@paleolimbot
Copy link
Member Author

Thanks! I'll merge tomorrow if there are no objections!

@paleolimbot paleolimbot merged commit 75acf37 into apache:main Apr 21, 2025
37 checks passed
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 75acf37.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

@pitrou
Copy link
Member

pitrou commented Apr 22, 2025

I forgot, do we have an issue open to enable extension types by default?

@wgtmac
Copy link
Member

wgtmac commented Apr 22, 2025

I didn't find one through a quick search. Should I create one? You mean ArrowReaderProperties::arrow_extensions_enabled_, right? @pitrou

@pitrou
Copy link
Member

pitrou commented Apr 22, 2025

I didn't find one through a quick search. Should I create one? You mean ArrowReaderProperties::arrow_extensions_enabled_, right? @pitrou

Yes, and its Python counterpart. I think we discussed enabling it by default before @rok ?

@rok
Copy link
Member

rok commented Apr 22, 2025

We did #44070 (comment), here's the corresponding issue: #44500. I'm happy to open a PR later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants