Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Apr 18, 2024

Rationale for this change

Several users have expressed a need for a UUID type. This is to provide a canonical UUID extension type specification.

What changes are included in this PR?

This adds to documentation of canonical extension types.

Are these changes tested?

No as only docs are changed.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #41298 has been automatically assigned in GitHub to PR creator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't FixedSizeBinary a logical type? Is it specified somewhere what physical layout should be used? Unless I'm mistaken there are two options: fixed-size primitive layout (FixedSizePrimitive<[u8; 16]>) or fixed-size list layout (FixedSizeList<u8>(16))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed FixedSizeBinary maps to a single physical type (fixed size primitive). If this is ambiguous we should indeed specify (preferrably one) physical type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we want to use a logical type here, since that are the concrete types that are actually used in the various specs like IPC and C Data Interface (the phsyical layouts are just described in the format docs, but it are the logical types that are listed in the flatbuffer Schema.fbs).
While "storage type" might sound like it points to a physical layout, the docs also mention that extension types annotate any built-in logical type: https://arrow.apache.org/docs/dev/format/Columnar.html#extension-types

Sidenote: there is not really such a thing as a "physical type", but only physical layout. Next to the various layouts, we just have "types" on top of that, which are sometimes called "logical types", but we are not very consistent about that terminology ..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that specifying the version should be optional metadata?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID versioning seems like a concern for the generator and consumers. It could become very difficult for Arrow to guarantee anything about versions without validating the values one by one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should specify what byte (0 or 15) is most-significant byte.

For instance, Java treats the 0-th byte as the MSB [1] which is consistent with the convention that big-endian the network byte order [2]:

    private UUID(byte[] data) {
        long msb = 0;
        long lsb = 0;
        assert data.length == 16 : "data must be 16 bytes in length";
        for (int i=0; i<8; i++)
            msb = (msb << 8) | (data[i] & 0xff);
        for (int i=8; i<16; i++)
            lsb = (lsb << 8) | (data[i] & 0xff);
        this.mostSigBits = msb;
        this.leastSigBits = lsb;
    }

When porting this code to C/C++, be careful with the fact that Java's integer types are big-endian (inheritance from SPARC), unlike most architectures we use today that are little-endian.

An advantage of putting the MSB at byte 0 is that when you parse an UUID string you read the string from the MSB to the LSB and write the UUID data from 0 to 15.

[1] https://github.com/openjdk/jdk/blob/819f3d6fc70ff6fe54ac5f9033c17c3dd4326aa5/src/java.base/share/classes/java/util/UUID.java#L116-L126
[2] https://stackoverflow.com/questions/13514614/why-is-network-byte-order-defined-to-be-big-endian

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've added a comment (0eaa1b9) and reported to the vote thread.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 30, 2024
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merge depends on mailing list votes.

@rok rok requested a review from emkornfield May 7, 2024 12:02
@rok
Copy link
Member Author

rok commented May 7, 2024

@emkornfield do you have any concerns left or shall we proceed and merge?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 7, 2024
@rok
Copy link
Member Author

rok commented May 8, 2024

If there's no further comments I'll merge this today.

@rok rok merged commit c5be027 into apache:main May 8, 2024
@rok rok deleted the uuid_spec branch May 8, 2024 22:46
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit c5be027.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…cation for UUID (apache#41299)

### Rationale for this change

Several users have expressed a need for a UUID type. This is to provide a canonical UUID extension type specification.

### What changes are included in this PR?

This adds to documentation of canonical extension types.

### Are these changes tested?

No as only docs are changed.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41298

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>
ianmcook added a commit that referenced this pull request Sep 17, 2024
### Rationale for this change

UUID extension types were made canonical in #41299 and are getting
native support in C++ and Python in #37298. As such, it makes sense to
provide an alternative user-defined extension type as an example that is
unlikely to become a canonical extension type anytime soon.

After discussion in #43809, we determined a `RationalType` would make
sense.

Please note that this is a redo of #43849 as I made a blunder and
accidentally pushed a branch that was in a wonky state.

### What changes are included in this PR?

A change in several doc locations which reference a `UuidType` extension
type have been changed to a `RationalType`.

For consistency, this PR also changes single quotes (`''`) to double
quotes (`""`) throughout the Python examples that it modifies.

Also, seemingly unrelated to this change, some doctests began failing as
numpy changed the `repr` of `float16`'s between 1.x and 2.x. We have
updated the failing doctest so that it supports both styles.

### Are these changes tested?

These are documentation changes and `archery docker run
conda-python-docs` succeeds locally.

### Are there any user-facing changes?

No.

cc @ianmcook @rok 
* GitHub Issue: #43809

---------

Co-authored-by: Ian Cook <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants