-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Context: I ran into this issue when reading Parquet files created by GDAL (using the Arrow C++ APIs, OSGeo/gdal#5477), which writes files that have custom key_value_metadata, but without storing ARROW:schema in those metadata (cc @paleolimbot)
—
Both in reading and writing files, I expected that we would map Arrow Schema::metadata with Parquet FileMetaData::key_value_metadata. But apparently this doesn't (always) happen out of the box, and only happens through the "ARROW:schema" field (which stores the original Arrow schema, and thus the metadata stored in this schema).
For example, when writing a Table with schema metadata, this is not stored directly in the Parquet FileMetaData (code below is using branch from ARROW-16337 to have the store_schema keyword):
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
pq.write_table(table, "test_metadata_without_arrow_schema.parquet", store_schema=False)
# original schema has metadata
>>> table.schema
a: int64
-- schema metadata --
key: 'value'
# reading back only has the metadata in case we stored ARROW:schema
>>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
a: int64
-- schema metadata --
key: 'value'
# and not if ARROW:schema is absent
>>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
a: int64It seems that if we store the ARROW:schema, we also store the schema metadata separately. But if store_schema is False, we also stop writing those metadata (not fully sure if this is the intended behaviour, and that's the reason for the above output):
# when storing the ARROW:schema, we ALSO store key:value metadata
>>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
{b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...',
b'key': b'value'}
# when not storing the schema, we also don't store the key:value
>>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is None
TrueOn the reading side, it seems that we generally do read custom key/value metadata into schema metadata. We don't have the pyarrow APIs at the moment to create such a file (given the above), but with a small patch I could create such a file:
# a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
>>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
{b'key': b'value'}
# this metadata is now correctly mapped to the Arrow schema metadata
>>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
a: int64
-- schema metadata --
key: 'value'But if you have a file that has both custom key/value metadata and an "ARROW:schema" key, we actually ignore the custom keys, and only look at the "ARROW:schema" one.
This was the case that I ran into with GDAL, where I have a file with both keys, but where the custom "geo" key is not also included in the serialized arrow schema in the "ARROW:schema" key:
# includes both keys in the Parquet file
>>> pq.read_metadata("test_gdal.parquet").metadata
{b'geo': b'{"version":"0.1.0","...',
b'ARROW:schema': b'/////3gBAAAQ...'}
# the "geo" key is lost in the Arrow schema
>>> pq.read_table("test_gdal.parquet").schema.metadata is None
TrueReporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
- [C++][Parquet] Field-level metadata are not supported? (ColumnMetadata.key_value_metadata) (is related to)
Note: This issue was originally created as ARROW-16339. Please see the migration documentation for further details.