-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API, Core: Add geometry and geography types support #12346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b39e32c to
904f503
Compare
904f503 to
896bc19
Compare
|
@rdblue @szehon-ho @flyrain please review when you have time 🙏🏻 |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey , really sorry for delay, let's work on it this week. Some early comments
|
|
||
| private final Geometry geometry; | ||
|
|
||
| public Geography(Geometry geometry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need interop between Geography and Geometry? I assume we have just have a class for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define Geography this way to reuse the data structures of coordinates and geospatial objects (Points, LineString, etc.), as well as WKB/WKT functions provided by JTS. Geography does not interoperate with Geometry, they are different classes but have the same data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a base class, and two implementing classes then? its confusing as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try removing Geometry and Geography from iceberg-api, and use ByteBuffer as the underlying Java class for these types.
| return String.valueOf(value).startsWith((String) literal.value()); | ||
| case NOT_STARTS_WITH: | ||
| return !String.valueOf(value).startsWith((String) literal.value()); | ||
| case ST_INTERSECTS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets split out predicate pruning support. I feel 80% of the changes is to support those, let's focus in this pr to get the API right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predicates and pruning were removed from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
896bc19 to
4d1bb83
Compare
build.gradle
Outdated
|
|
||
| dependencies { | ||
| implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow') | ||
| api libs.jts.core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not comfortable leaking the JTS API from Iceberg API, as this forces all the consumer to now depend on JTS. Can we instead encapsulate this dep in iceberg-core? Going to check with other Iceberg PMC's on this.
At very least this should be implementation.
| if (primitive.typeId() == Type.TypeID.GEOMETRY) { | ||
| Types.GeometryType geometryType = (Types.GeometryType) primitive; | ||
| generator.writeStartObject(); | ||
| generator.writeStringField("type", "geometry"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use the constant for "type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced constant literals such as "type", "geometry", "geography", "crs" and "algorithm" with constants.
|
|
||
| static void toJson(Type.PrimitiveType primitive, JsonGenerator generator) throws IOException { | ||
| generator.writeString(primitive.toString()); | ||
| if (primitive.typeId() == Type.TypeID.GEOMETRY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use switch statement like rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use switch-case.
| generator.writeStringField("type", "geometry"); | ||
| String crs = geometryType.crs(); | ||
| if (!crs.isEmpty()) { | ||
| generator.writeStringField("crs", geometryType.crs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "crs" seems used frequently enough to warrant a constant.
| return TYPES.get(lowerTypeString); | ||
| } | ||
|
|
||
| if (lowerTypeString.startsWith("geometry")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant we put the startsWith in the regex, like DECIMAL and FIXED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot match the regex with lowerTypeString, because we want to extract the original CRS from the type instead of a lower case one. I'll see how to make it more concise and get rid of startsWith.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we consider CRS to be case sensitive?
The capture group of a case insensitive regular expression returns the matched characters without modification. So if CRS is case sensitive, you can still use a regular expression, like this:
Pattern pattern = Pattern.compile("geometry\\(([^)]*)\\)", Pattern.CASE_INSENSITIVE);This test will pass:
Matcher m = pattern.matcher("GEOMETRY(aBc)");
assertThat(m.matches()).isTrue();
assertThat(m.group(1)).isEqualTo("aBc");I'd prefer using a case insensitive pattern on typeString like this, rather than startsWith and substring with a hard-coded length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using case insensitive regex pattern matching.
| return new Literals.GeometryLiteral(value); | ||
| } | ||
|
|
||
| static Literal<Geography> of(Geography value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here, let's specifically not leak the Geography in the API, we should just have a wrapper class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove GeometryLiteral ad GeographyLiteral. The iceberg expressions should only work with bounding boxes, so there's no need to involve full-fledged Geometry and Geography objects.
|
|
||
| private final Geometry geometry; | ||
|
|
||
| public Geography(Geometry geometry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a base class, and two implementing classes then? its confusing as it is.
| } | ||
|
|
||
| public static GeographyType of(String crs) { | ||
| return of(crs, DEFAULT_ALGORITHM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option is to allow null for algorithm right? and the code can default it , as per the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. The algorithm field is null when no algorithm is specified.
4d1bb83 to
73f6022
Compare
…ry and geography types to ByteBuffer
|
I have removed dependency to JTS from iceberg-api, now the underlying Java type of Geometry and Geography are I have removed code for converting between ByteBuffer and geospatial types, removed GeometryLiteral/GeographyLiteral, as well as default initial/write value support for geospatial fields. I'll proceed adding them once we are all agree on the API of primitive geospatial types. |
| class Identity<T> implements Transform<T, T> { | ||
| private static final Identity<?> INSTANCE = new Identity<>(); | ||
|
|
||
| private static final Set<Type.TypeID> UNSUPPORTED_TYPES = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
|
|
||
| public static GeometryType of(String crs) { | ||
| return new GeometryType(crs == null ? "" : crs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Maybe crs == null can be crs == null || crs.isEmpty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believed that including crs.isEmpty() was redundant in this situation, so I didn't add it, as we would return an empty string regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty string should not be allowed. That is an invalid CRS because it has no information. Null is fine to support here if you expect to have situations in which callers will find it convenient to pass null to mean "use the default".
| // We need to set outputDimension = 4 for XYM geometries to make JTS WKTWriter or WKBWriter work | ||
| // correctly. | ||
| // The WKB/WKT writers will ignore Z ordinate for XYM geometries. | ||
| if (!Double.isNaN(coordinate.getZ())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could do (not a must).
if (Coordinate.NULL_ORDINATE != coordinate.getZ())There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that we cannot do this. Coordinate.NULL_ORDINATE is NaN, comparing it with other floating point number (including NaN) using != will always yield true.
| import org.locationtech.jts.geom.Geometry; | ||
| import org.locationtech.jts.geom.GeometryFactory; | ||
|
|
||
| public class TestGeometryUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you covered Coordinate and its subclasses except ExtendedCoordinate, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As far as I know, ExtendedCoordinate is part of JTS example code so we don't need to handle it. JTS will only be used internally in iceberg and it exchanges geospatial data with other systems through WKB/Bounds byte buffer, so there's no need to take non-standard extensions of Coordinate into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
hsiang-c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Kontinuation for removing the api dependency on jts from iceberg-api, this looks a lot closer to me
| } | ||
|
|
||
| public static EdgeInterpolationAlgorithm fromName(String algorithmName) { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precondition, check null and throw exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added precondition with error message: Invalid edge interpolation algorithm: null.
| return EdgeInterpolationAlgorithm.valueOf(algorithmName.toUpperCase(Locale.ENGLISH)); | ||
| } catch (IllegalArgumentException e) { | ||
| throw new IllegalArgumentException( | ||
| String.format("Invalid edge interpolation algorithm name: %s", algorithmName), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can remove redundant 'name'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Now the error message is Invalid edge interpolation algorithm: %s
| } | ||
|
|
||
| public static GeometryType get() { | ||
| return of(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not , new GeometryType("")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Now we are using null as default value as requested. For now it is new GeometryType(null).
| } | ||
|
|
||
| public static GeographyType get() { | ||
| return of(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not new GeographyType("")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Now it is new GeographyType(null, null).
| .isThrownBy(() -> required("field").ofType(Types.StringType.get()).build()) | ||
| .withMessage("Id cannot be null"); | ||
|
|
||
| assertThat(Types.fromPrimitiveString("geometry")).isEqualTo(Types.GeometryType.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in its own test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this suite, these cases would be added to fromPrimitiveString and fromTypeName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved tests to fromTypeName and fromPrimitiveString.
| } | ||
|
|
||
| private static Types.GeometryType geometryFromJson(JsonNode json) { | ||
| String crs = JsonUtil.getStringOrNull("crs", json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these (and following) can be replaced by the constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reverted since we are not encoding geospatial types as JSON.
| private final String crs; | ||
|
|
||
| private GeometryType(String crs) { | ||
| this.crs = crs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precondition that crs is not null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to allow null to indicate the default. That's easier to detect for toString to show geometry instead of geometry().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow null and not allow empty string for CRS. Additionally we check that the CRS should not have leading or trailing spaces to make sure that the value of CRS will be persisted after converting to string and parsing it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why not always have toString show the default value like Geography? Geometry("OGC:CRS84"). Is it to save some bytes ? Seems to be easier imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geography does not always show the default CRS. It happens when the edge algorithm is not using default value but CRS is left unset. We have to fill in the default CRS as the first argument in this case.
I thought that the behavior of toString is to omit optional parameters when possible. See https://github.com/apache/iceberg/pull/12346/files/0c451cac53f4098680aa0ae518af70dc45efaeb1#r1994430054.
| } | ||
|
|
||
| private static int getOutputDimension(Geometry geom) { | ||
| int dimension = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make a constant to avoid instantiating it every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is removed as we don't need WKB/WKT serialization according to the latest spec.
|
Also, (as can't comment on files that are not in the change) Do we need to add Geo types to following places?
|
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java
Outdated
Show resolved
Hide resolved
| case FIXED: | ||
| case BINARY: | ||
| case GEOMETRY: | ||
| case GEOGRAPHY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this included? I don't think that we want to translate these into Hive types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted this change. Does it mean that Hive catalog won't support geospatial columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this is needed, we can add it. But there's no value in translating to binary right now.
| Matcher geometry = GEOMETRY_PARAMETERS.matcher(typeString); | ||
| if (geometry.matches()) { | ||
| String crs = geometry.group(1); | ||
| Preconditions.checkArgument(!crs.contains(","), "Invalid CRS: %s", crs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this invalid? This should pass the string as-is. It is up to others to determine what is a valid CRS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comma in CRS will disrupt the parsing of geography types such as geography(crs,with,comma, spherical), so I treat crs containing comma as invalid to avoid ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this check and pass the CRS string as-is, since it won't cause ambiguity for geometry types.
| } | ||
|
|
||
| public String crs() { | ||
| return crs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CRS is null, should this return DEFAULT_CRS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to return DEFAULT_CRS, so that the callers don't have to handle both nulls and default CRS.
We can make GeographyType.algorithm return EdgeAlgorithm.SPHERICAL when algorithm is null as well.
|
Looks great! There are some minor nits, but I think it makes sense to commit this to move forward. Thanks @Kontinuation and @szehon-ho! |
|
Thanks , great work @Kontinuation , and thanks @rdblue ! |
This adds 2 primitive types to
iceberg-apiandiceberg-corefor supporting geospatial data types, partially implementing the iceberg geo spec: #10981The newly added primitive types are:
geometry(C): geometries with linear/planar edge interpolationgeography(C, A): geometries with non-linear edge interpolation, the algorithm for interpolating edges is parameterized byA.