-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-44500: [Python][Parquet] Map Parquet logical types to Arrow extension types by default #46772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44500: [Python][Parquet] Map Parquet logical types to Arrow extension types by default #46772
Conversation
|
|
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @paleolimbot, thanks for all the work connected to the extension types!
I think the change in the behaviour makes sense and the opinions on the issue are all in favour, so I will go ahead and merge today/tomorrow if there will be no other comments.
|
@github-actions crossbow submit -g python |
|
Revision: c6c3236 Submitted crossbow builds: ursacomputing/crossbow @ actions-3bbbbaa1af |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot , LGTM
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 639201b. There were 123 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…extension types by default (apache#46772) ### Rationale for this change The Parquet C++ implementation now supports reading four logical types (JSON, UUID, Geometry, Geography) as Arrow extension types; however, users have to opt-in to avoid loosing the logical type on read. ### What changes are included in this PR? This PR sets the default value of `arrow_extensions_enabled` to `True` (in Python). ### Are these changes tested? Yes, the behaviour of `arrow_extensions_enabled` was already tested (and tests were updated to reflect the new default value). ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Reading Parquet files that contained a JSON or UUID logical type will now have an extension type rather than string or fixed size binary, respectively. Python users that were relying on the previous behaviour would have to explicitly cast to storage or use `read_table(..., arrow_extensions_enabled=False)` after this PR: ```python import uuid import pyarrow as pa json_array = pa.array(['{"k": "v"}'], pa.json_()) json_array.cast(pa.string()) #> [ #> "{"k": "v"}" #> ] uuid_array = pa.array([uuid.uuid4().bytes], pa.uuid()) uuid_array.cast(pa.binary(16)) #> <pyarrow.lib.FixedSizeBinaryArray object at 0x11e42b1c0> #> [ #> 746C1022AB434A97972E1707EC3EE8F4 #> ] ``` * GitHub Issue: apache#44500 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: AlenkaF <[email protected]>
Rationale for this change
The Parquet C++ implementation now supports reading four logical types (JSON, UUID, Geometry, Geography) as Arrow extension types; however, users have to opt-in to avoid loosing the logical type on read.
What changes are included in this PR?
This PR sets the default value of
arrow_extensions_enabledtoTrue(in Python).Are these changes tested?
Yes, the behaviour of
arrow_extensions_enabledwas already tested (and tests were updated to reflect the new default value).Are there any user-facing changes?
This PR includes breaking changes to public APIs.
Reading Parquet files that contained a JSON or UUID logical type will now have an extension type rather than string or fixed size binary, respectively. Python users that were relying on the previous behaviour would have to explicitly cast to storage or use
read_table(..., arrow_extensions_enabled=False)after this PR: