Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Jun 10, 2025

Rationale for this change

The Parquet C++ implementation now supports reading four logical types (JSON, UUID, Geometry, Geography) as Arrow extension types; however, users have to opt-in to avoid loosing the logical type on read.

What changes are included in this PR?

This PR sets the default value of arrow_extensions_enabled to True (in Python).

Are these changes tested?

Yes, the behaviour of arrow_extensions_enabled was already tested (and tests were updated to reflect the new default value).

Are there any user-facing changes?

This PR includes breaking changes to public APIs.

Reading Parquet files that contained a JSON or UUID logical type will now have an extension type rather than string or fixed size binary, respectively. Python users that were relying on the previous behaviour would have to explicitly cast to storage or use read_table(..., arrow_extensions_enabled=False) after this PR:

import uuid
import pyarrow as pa

json_array = pa.array(['{"k": "v"}'], pa.json_())
json_array.cast(pa.string())
#> [
#>   "{"k": "v"}"
#> ]

uuid_array = pa.array([uuid.uuid4().bytes], pa.uuid())
uuid_array.cast(pa.binary(16))
#> <pyarrow.lib.FixedSizeBinaryArray object at 0x11e42b1c0>
#> [
#>   746C1022AB434A97972E1707EC3EE8F4
#> ]

@github-actions
Copy link

⚠️ GitHub issue #44500 has been automatically assigned in GitHub to PR creator.

@paleolimbot paleolimbot marked this pull request as ready for review June 11, 2025 03:31
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @paleolimbot, thanks for all the work connected to the extension types!

I think the change in the behaviour makes sense and the opinions on the issue are all in favour, so I will go ahead and merge today/tomorrow if there will be no other comments.

@AlenkaF
Copy link
Member

AlenkaF commented Jun 17, 2025

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: c6c3236

Submitted crossbow builds: ursacomputing/crossbow @ actions-3bbbbaa1af

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot , LGTM

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 17, 2025
@AlenkaF AlenkaF merged commit 639201b into apache:main Jun 17, 2025
15 of 17 checks passed
@AlenkaF AlenkaF removed the awaiting merge Awaiting merge label Jun 17, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 639201b.

There were 123 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

alinaliBQ pushed a commit to Bit-Quill/arrow that referenced this pull request Jun 17, 2025
…extension types by default (apache#46772)

### Rationale for this change

The Parquet C++ implementation now supports reading four logical types (JSON, UUID, Geometry, Geography) as Arrow extension types; however, users have to opt-in to avoid loosing the logical type on read.

### What changes are included in this PR?

This PR sets the default value of `arrow_extensions_enabled` to `True` (in Python).

### Are these changes tested?

Yes, the behaviour of `arrow_extensions_enabled` was already tested (and tests were updated to reflect the new default value).

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**

Reading Parquet files that contained a JSON or UUID logical type will now have an extension type rather than string or fixed size binary, respectively. Python users that were relying on the previous behaviour would have to explicitly cast to storage or use `read_table(..., arrow_extensions_enabled=False)` after this PR:

```python
import uuid
import pyarrow as pa

json_array = pa.array(['{"k": "v"}'], pa.json_())
json_array.cast(pa.string())
#> [
#>   "{"k": "v"}"
#> ]

uuid_array = pa.array([uuid.uuid4().bytes], pa.uuid())
uuid_array.cast(pa.binary(16))
#> <pyarrow.lib.FixedSizeBinaryArray object at 0x11e42b1c0>
#> [
#>   746C1022AB434A97972E1707EC3EE8F4
#> ]
```
* GitHub Issue: apache#44500

Authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants