-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41298: [Format][Docs] Add a canonical extension type specification for UUID #41299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't FixedSizeBinary a logical type? Is it specified somewhere what physical layout should be used? Unless I'm mistaken there are two options: fixed-size primitive layout (FixedSizePrimitive<[u8; 16]>) or fixed-size list layout (FixedSizeList<u8>(16))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed FixedSizeBinary maps to a single physical type (fixed size primitive). If this is ambiguous we should indeed specify (preferrably one) physical type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK we want to use a logical type here, since that are the concrete types that are actually used in the various specs like IPC and C Data Interface (the phsyical layouts are just described in the format docs, but it are the logical types that are listed in the flatbuffer Schema.fbs).
While "storage type" might sound like it points to a physical layout, the docs also mention that extension types annotate any built-in logical type: https://arrow.apache.org/docs/dev/format/Columnar.html#extension-types
Sidenote: there is not really such a thing as a "physical type", but only physical layout. Next to the various layouts, we just have "types" on top of that, which are sometimes called "logical types", but we are not very consistent about that terminology ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that specifying the version should be optional metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID versioning seems like a concern for the generator and consumers. It could become very difficult for Arrow to guarantee anything about versions without validating the values one by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should specify what byte (0 or 15) is most-significant byte.
For instance, Java treats the 0-th byte as the MSB [1] which is consistent with the convention that big-endian the network byte order [2]:
private UUID(byte[] data) {
long msb = 0;
long lsb = 0;
assert data.length == 16 : "data must be 16 bytes in length";
for (int i=0; i<8; i++)
msb = (msb << 8) | (data[i] & 0xff);
for (int i=8; i<16; i++)
lsb = (lsb << 8) | (data[i] & 0xff);
this.mostSigBits = msb;
this.leastSigBits = lsb;
}When porting this code to C/C++, be careful with the fact that Java's integer types are big-endian (inheritance from SPARC), unlike most architectures we use today that are little-endian.
An advantage of putting the MSB at byte 0 is that when you parse an UUID string you read the string from the MSB to the LSB and write the UUID data from 0 to 15.
[1] https://github.com/openjdk/jdk/blob/819f3d6fc70ff6fe54ac5f9033c17c3dd4326aa5/src/java.base/share/classes/java/util/UUID.java#L116-L126
[2] https://stackoverflow.com/questions/13514614/why-is-network-byte-order-defined-to-be-big-endian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've added a comment (0eaa1b9) and reported to the vote thread.
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merge depends on mailing list votes.
|
@emkornfield do you have any concerns left or shall we proceed and merge? |
|
If there's no further comments I'll merge this today. |
Co-authored-by: Antoine Pitrou <[email protected]>
|
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit c5be027. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
…cation for UUID (apache#41299) ### Rationale for this change Several users have expressed a need for a UUID type. This is to provide a canonical UUID extension type specification. ### What changes are included in this PR? This adds to documentation of canonical extension types. ### Are these changes tested? No as only docs are changed. ### Are there any user-facing changes? No. * GitHub Issue: apache#41298 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>
### Rationale for this change UUID extension types were made canonical in #41299 and are getting native support in C++ and Python in #37298. As such, it makes sense to provide an alternative user-defined extension type as an example that is unlikely to become a canonical extension type anytime soon. After discussion in #43809, we determined a `RationalType` would make sense. Please note that this is a redo of #43849 as I made a blunder and accidentally pushed a branch that was in a wonky state. ### What changes are included in this PR? A change in several doc locations which reference a `UuidType` extension type have been changed to a `RationalType`. For consistency, this PR also changes single quotes (`''`) to double quotes (`""`) throughout the Python examples that it modifies. Also, seemingly unrelated to this change, some doctests began failing as numpy changed the `repr` of `float16`'s between 1.x and 2.x. We have updated the failing doctest so that it supports both styles. ### Are these changes tested? These are documentation changes and `archery docker run conda-python-docs` succeeds locally. ### Are there any user-facing changes? No. cc @ianmcook @rok * GitHub Issue: #43809 --------- Co-authored-by: Ian Cook <[email protected]>
Rationale for this change
Several users have expressed a need for a UUID type. This is to provide a canonical UUID extension type specification.
What changes are included in this PR?
This adds to documentation of canonical extension types.
Are these changes tested?
No as only docs are changed.
Are there any user-facing changes?
No.