Skip to content

Writing UUID using PyArrow does not set the UUID logical type on Parquet #46469

@Fokko

Description

@Fokko

Describe the bug, including details regarding any error messages, version, and platform.

Consider the following path:

In [1]: import pyarrow as pa
   ...: import uuid
   ...: 
   ...: schema = pa.schema(
   ...:    [
   ...:        pa.field("uuid", pa.uuid(), nullable=False),
   ...:    ]
   ...: )
   ...: 
   ...: arr_table = pa.Table.from_pydict(
   ...:     {
   ...:         "uuid": [
   ...:             uuid.UUID("00000000-0000-0000-0000-000000000000").bytes,
   ...:             uuid.UUID("11111111-1111-1111-1111-111111111111").bytes,
   ...:         ],
   ...:     },
   ...:     schema=schema,
   ...: )
   ...: 
   ...: import pyarrow.parquet as pq
   ...: 
   ...: with pq.ParquetWriter("/tmp/some-parquet-with-uuid.parquet", schema=schema) as writer:
   ...:    writer.write(arr_table)
   ...: 
> parq /tmp/some-parquet-with-uuid.parquet -s                                                                                                                    

 # Schema 
 <pyarrow._parquet.ParquetSchema object at 0x105cd0480>
required group field_id=-1 schema {
  required fixed_len_byte_array(16) field_id=-1 uuid;
}

Example one that has been created by Iceberg (Java):

parq /var/folders/h0/wqtwn1ks0m3bksc8n7mp4_lr0000gp/T/hive12405183144450566152/table/data/uuid_bucket=7/00000-1-daaf21e3-fbee-4954-88d8-ea0371f62a6a-0-00001.parquet -s     

 # Schema 
 <pyarrow._parquet.ParquetSchema object at 0x1058ce840>
required group field_id=-1 table {
  required fixed_len_byte_array(16) field_id=1 uuid (UUID);
}

The (UUID) indicates the logical type annotation many readers rely on.

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions