Skip to content

Conversation

@Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Feb 20, 2025

This adds 2 primitive types to iceberg-api and iceberg-core for supporting geospatial data types, partially implementing the iceberg geo spec: #10981

The newly added primitive types are:

  • geometry(C): geometries with linear/planar edge interpolation
  • geography(C, A): geometries with non-linear edge interpolation, the algorithm for interpolating edges is parameterized by A.

@jiayuasu
Copy link
Member

@rdblue @szehon-ho @flyrain please review when you have time 🙏🏻

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey , really sorry for delay, let's work on it this week. Some early comments


private final Geometry geometry;

public Geography(Geometry geometry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need interop between Geography and Geometry? I assume we have just have a class for each.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define Geography this way to reuse the data structures of coordinates and geospatial objects (Points, LineString, etc.), as well as WKB/WKT functions provided by JTS. Geography does not interoperate with Geometry, they are different classes but have the same data structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a base class, and two implementing classes then? its confusing as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try removing Geometry and Geography from iceberg-api, and use ByteBuffer as the underlying Java class for these types.

return String.valueOf(value).startsWith((String) literal.value());
case NOT_STARTS_WITH:
return !String.valueOf(value).startsWith((String) literal.value());
case ST_INTERSECTS:
Copy link
Member

@szehon-ho szehon-ho Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets split out predicate pruning support. I feel 80% of the changes is to support those, let's focus in this pr to get the API right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predicates and pruning were removed from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

build.gradle Outdated

dependencies {
implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
api libs.jts.core
Copy link
Member

@szehon-ho szehon-ho Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not comfortable leaking the JTS API from Iceberg API, as this forces all the consumer to now depend on JTS. Can we instead encapsulate this dep in iceberg-core? Going to check with other Iceberg PMC's on this.

At very least this should be implementation.

if (primitive.typeId() == Type.TypeID.GEOMETRY) {
Types.GeometryType geometryType = (Types.GeometryType) primitive;
generator.writeStartObject();
generator.writeStringField("type", "geometry");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use the constant for "type"

Copy link
Member Author

@Kontinuation Kontinuation Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced constant literals such as "type", "geometry", "geography", "crs" and "algorithm" with constants.


static void toJson(Type.PrimitiveType primitive, JsonGenerator generator) throws IOException {
generator.writeString(primitive.toString());
if (primitive.typeId() == Type.TypeID.GEOMETRY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use switch statement like rest of the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use switch-case.

generator.writeStringField("type", "geometry");
String crs = geometryType.crs();
if (!crs.isEmpty()) {
generator.writeStringField("crs", geometryType.crs());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "crs" seems used frequently enough to warrant a constant.

return TYPES.get(lowerTypeString);
}

if (lowerTypeString.startsWith("geometry")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant we put the startsWith in the regex, like DECIMAL and FIXED

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot match the regex with lowerTypeString, because we want to extract the original CRS from the type instead of a lower case one. I'll see how to make it more concise and get rid of startsWith.

Copy link
Contributor

@rdblue rdblue Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we consider CRS to be case sensitive?

The capture group of a case insensitive regular expression returns the matched characters without modification. So if CRS is case sensitive, you can still use a regular expression, like this:

    Pattern pattern = Pattern.compile("geometry\\(([^)]*)\\)", Pattern.CASE_INSENSITIVE);

This test will pass:

    Matcher m = pattern.matcher("GEOMETRY(aBc)");
    assertThat(m.matches()).isTrue();
    assertThat(m.group(1)).isEqualTo("aBc");

I'd prefer using a case insensitive pattern on typeString like this, rather than startsWith and substring with a hard-coded length.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using case insensitive regex pattern matching.

return new Literals.GeometryLiteral(value);
}

static Literal<Geography> of(Geography value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here, let's specifically not leak the Geography in the API, we should just have a wrapper class as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove GeometryLiteral ad GeographyLiteral. The iceberg expressions should only work with bounding boxes, so there's no need to involve full-fledged Geometry and Geography objects.


private final Geometry geometry;

public Geography(Geometry geometry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a base class, and two implementing classes then? its confusing as it is.

}

public static GeographyType of(String crs) {
return of(crs, DEFAULT_ALGORITHM);
Copy link
Member

@szehon-ho szehon-ho Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is to allow null for algorithm right? and the code can default it , as per the spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The algorithm field is null when no algorithm is specified.

@Kontinuation
Copy link
Member Author

Kontinuation commented Mar 5, 2025

I have removed dependency to JTS from iceberg-api, now the underlying Java type of Geometry and Geography are ByteBuffer. Now JTS is a implementation dependency of iceberg-core.

I have removed code for converting between ByteBuffer and geospatial types, removed GeometryLiteral/GeographyLiteral, as well as default initial/write value support for geospatial fields. I'll proceed adding them once we are all agree on the API of primitive geospatial types.

class Identity<T> implements Transform<T, T> {
private static final Identity<?> INSTANCE = new Identity<>();

private static final Set<Type.TypeID> UNSUPPORTED_TYPES =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

public static GeometryType of(String crs) {
return new GeometryType(crs == null ? "" : crs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Maybe crs == null can be crs == null || crs.isEmpty()

Copy link
Member Author

@Kontinuation Kontinuation Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believed that including crs.isEmpty() was redundant in this situation, so I didn't add it, as we would return an empty string regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty string should not be allowed. That is an invalid CRS because it has no information. Null is fine to support here if you expect to have situations in which callers will find it convenient to pass null to mean "use the default".

// We need to set outputDimension = 4 for XYM geometries to make JTS WKTWriter or WKBWriter work
// correctly.
// The WKB/WKT writers will ignore Z ordinate for XYM geometries.
if (!Double.isNaN(coordinate.getZ())) {
Copy link
Contributor

@hsiang-c hsiang-c Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could do (not a must).

if (Coordinate.NULL_ORDINATE != coordinate.getZ())

Copy link
Member Author

@Kontinuation Kontinuation Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that we cannot do this. Coordinate.NULL_ORDINATE is NaN, comparing it with other floating point number (including NaN) using != will always yield true.

import org.locationtech.jts.geom.Geometry;
import org.locationtech.jts.geom.GeometryFactory;

public class TestGeometryUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you covered Coordinate and its subclasses except ExtendedCoordinate, am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As far as I know, ExtendedCoordinate is part of JTS example code so we don't need to handle it. JTS will only be used internally in iceberg and it exchanges geospatial data with other systems through WKB/Bounds byte buffer, so there's no need to take non-standard extensions of Coordinate into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Copy link
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kontinuation for removing the api dependency on jts from iceberg-api, this looks a lot closer to me

}

public static EdgeInterpolationAlgorithm fromName(String algorithmName) {
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precondition, check null and throw exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added precondition with error message: Invalid edge interpolation algorithm: null.

return EdgeInterpolationAlgorithm.valueOf(algorithmName.toUpperCase(Locale.ENGLISH));
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException(
String.format("Invalid edge interpolation algorithm name: %s", algorithmName), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can remove redundant 'name'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now the error message is Invalid edge interpolation algorithm: %s

}

public static GeometryType get() {
return of("");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not , new GeometryType("")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now we are using null as default value as requested. For now it is new GeometryType(null).

}

public static GeographyType get() {
return of("");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not new GeographyType("")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now it is new GeographyType(null, null).

.isThrownBy(() -> required("field").ofType(Types.StringType.get()).build())
.withMessage("Id cannot be null");

assertThat(Types.fromPrimitiveString("geometry")).isEqualTo(Types.GeometryType.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in its own test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this suite, these cases would be added to fromPrimitiveString and fromTypeName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved tests to fromTypeName and fromPrimitiveString.

}

private static Types.GeometryType geometryFromJson(JsonNode json) {
String crs = JsonUtil.getStringOrNull("crs", json);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these (and following) can be replaced by the constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reverted since we are not encoding geospatial types as JSON.

private final String crs;

private GeometryType(String crs) {
this.crs = crs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precondition that crs is not null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to allow null to indicate the default. That's easier to detect for toString to show geometry instead of geometry().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow null and not allow empty string for CRS. Additionally we check that the CRS should not have leading or trailing spaces to make sure that the value of CRS will be persisted after converting to string and parsing it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why not always have toString show the default value like Geography? Geometry("OGC:CRS84"). Is it to save some bytes ? Seems to be easier imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geography does not always show the default CRS. It happens when the edge algorithm is not using default value but CRS is left unset. We have to fill in the default CRS as the first argument in this case.

I thought that the behavior of toString is to omit optional parameters when possible. See https://github.com/apache/iceberg/pull/12346/files/0c451cac53f4098680aa0ae518af70dc45efaeb1#r1994430054.

}

private static int getOutputDimension(Geometry geom) {
int dimension = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make a constant to avoid instantiating it every time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is removed as we don't need WKB/WKT serialization according to the latest spec.

@szehon-ho
Copy link
Member

Also, (as can't comment on files that are not in the change)

Do we need to add Geo types to following places?

  1. Types.java TYPES constant?
  2. TestSchemaUnionByFieldName primitiveTypes()?
  3. TestSchema TEST_TYPES?
  4. TestTypesUtil testTypes()?

case FIXED:
case BINARY:
case GEOMETRY:
case GEOGRAPHY:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this included? I don't think that we want to translate these into Hive types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this change. Does it mean that Hive catalog won't support geospatial columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is needed, we can add it. But there's no value in translating to binary right now.

Matcher geometry = GEOMETRY_PARAMETERS.matcher(typeString);
if (geometry.matches()) {
String crs = geometry.group(1);
Preconditions.checkArgument(!crs.contains(","), "Invalid CRS: %s", crs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this invalid? This should pass the string as-is. It is up to others to determine what is a valid CRS.

Copy link
Member Author

@Kontinuation Kontinuation Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comma in CRS will disrupt the parsing of geography types such as geography(crs,with,comma, spherical), so I treat crs containing comma as invalid to avoid ambiguity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this check and pass the CRS string as-is, since it won't cause ambiguity for geometry types.

}

public String crs() {
return crs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CRS is null, should this return DEFAULT_CRS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to return DEFAULT_CRS, so that the callers don't have to handle both nulls and default CRS.

We can make GeographyType.algorithm return EdgeAlgorithm.SPHERICAL when algorithm is null as well.

@rdblue rdblue merged commit e1e0a74 into apache:main Mar 25, 2025
43 checks passed
@rdblue
Copy link
Contributor

rdblue commented Mar 25, 2025

Looks great! There are some minor nits, but I think it makes sense to commit this to move forward. Thanks @Kontinuation and @szehon-ho!

@szehon-ho
Copy link
Member

Thanks , great work @Kontinuation , and thanks @rdblue !

Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Mar 28, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Mar 28, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Apr 21, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request May 21, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Sep 17, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Sep 18, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Sep 22, 2025
Kontinuation added a commit to Kontinuation/iceberg that referenced this pull request Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants