Skip to content

[C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata #31723

@asfimport

Description

@asfimport

Context: I ran into this issue when reading Parquet files created by GDAL (using the Arrow C++ APIs, OSGeo/gdal#5477), which writes files that have custom key_value_metadata, but without storing ARROW:schema in those metadata (cc @paleolimbot)

Both in reading and writing files, I expected that we would map Arrow Schema::metadata with Parquet FileMetaData::key_value_metadata. But apparently this doesn't (always) happen out of the box, and only happens through the "ARROW:schema" field (which stores the original Arrow schema, and thus the metadata stored in this schema).

For example, when writing a Table with schema metadata, this is not stored directly in the Parquet FileMetaData (code below is using branch from ARROW-16337 to have the store_schema keyword):

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
pq.write_table(table, "test_metadata_without_arrow_schema.parquet", store_schema=False)

# original schema has metadata
>>> table.schema
a: int64
-- schema metadata --
key: 'value'

# reading back only has the metadata in case we stored ARROW:schema
>>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
a: int64
-- schema metadata --
key: 'value'
# and not if ARROW:schema is absent
>>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
a: int64

It seems that if we store the ARROW:schema, we also store the schema metadata separately. But if store_schema is False, we also stop writing those metadata (not fully sure if this is the intended behaviour, and that's the reason for the above output):

# when storing the ARROW:schema, we ALSO store key:value metadata
>>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
{b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...',
 b'key': b'value'}
# when not storing the schema, we also don't store the key:value
>>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is None
True

On the reading side, it seems that we generally do read custom key/value metadata into schema metadata. We don't have the pyarrow APIs at the moment to create such a file (given the above), but with a small patch I could create such a file:

# a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
>>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
{b'key': b'value'}

# this metadata is now correctly mapped to the Arrow schema metadata
>>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
a: int64
-- schema metadata --
key: 'value'

But if you have a file that has both custom key/value metadata and an "ARROW:schema" key, we actually ignore the custom keys, and only look at the "ARROW:schema" one.
This was the case that I ran into with GDAL, where I have a file with both keys, but where the custom "geo" key is not also included in the serialized arrow schema in the "ARROW:schema" key:

# includes both keys in the Parquet file
>>> pq.read_metadata("test_gdal.parquet").metadata
{b'geo': b'{"version":"0.1.0","...',
 b'ARROW:schema': b'/////3gBAAAQ...'}
# the "geo" key is lost in the Arrow schema
>>> pq.read_table("test_gdal.parquet").schema.metadata is None
True

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-16339. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions