GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false #41769

TheNeuralBit · 2024-05-21T22:11:18Z

What changes are included in this PR?

This modifies parquet::arrow::FileWriter to always copy the arrow Table's schema-level metadata into the produced parquet file. Currently FileWriter only does this if ArrowWriterProperties::store_schema is true.

Are these changes tested?

Yes

Are there any user-facing changes?

This slightly changes the behavior of parquet::arrow::FileWriter, I'm not sure if this would be considered a user-facing change or not.

GitHub Issue: [C++][Parquet] parquet::arrow::FileWriter does not propagate schema-level metadata when ArrowWriterProperties::store_schema is false #41766

github-actions · 2024-05-21T22:11:46Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2024-05-21T22:16:06Z

⚠️ GitHub issue #41766 has been automatically assigned in GitHub to PR creator.

wgtmac

+1

mapleFU · 2024-05-22T03:13:01Z

This fixing looks nice to me, but I'm a bit confused why schema.metadata() is extracted to file-metadata, @wgtmac do you know the context here?

wgtmac · 2024-05-22T04:44:06Z

This fixing looks nice to me, but I'm a bit confused why schema.metadata() is extracted to file-metadata, @wgtmac do you know the context here?

I found this from #5077:

Add ArrowWriterProperties::store_schema() option which stores the Arrow schema used to create a Parquet file in a special ARROW:schema key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future

mapleFU · 2024-05-22T04:51:10Z

I found this from #5077

But this doesn't means other metadata in "schema" should written to file?

@TheNeuralBit would you mind run clang-format for cpp/src/parquet/arrow/arrow_reader_writer_test.cc to fix the lint?

TheNeuralBit · 2024-05-22T16:48:40Z

would you mind run clang-format

Done.

TheNeuralBit · 2024-05-22T16:53:01Z

This fixing looks nice to me, but I'm a bit confused why schema.metadata() is extracted to file-metadata

I agree this is a little surprising. Is there some other way to use FileWriter to write key-value metadata to the parquet file metadata? This was the first method I found.

Regardless I think we should be consistent between the store_schema true and false paths. Right now one copies the schema metadata to parquet file-level metadata and one does not. An alternative could be to change the store_schema true path (and document an alternative way to write file-level metadata) but that's more likely to be a breaking change I think.

mapleFU · 2024-05-22T16:57:08Z

Is there some other way to use FileWriter to write key-value metadata to the parquet file metadata? This was the first method I found.

arrow/cpp/src/parquet/file_writer.h

Line 207 in 8967ddc

void AddKeyValueMetadata(

. I'm trying to add it to arrow wrapper ( #41633 ) but I'm quite busy this week, maybe I'll finish it tomorrow?

Regardless I think we should be consistent between the store_schema true and false paths.

Yeah I think this edit is great but I'm a bit confused about why it uses key-value from schema. @pitrou may I ask the reason here?

mapleFU · 2024-05-22T16:58:15Z

--- a/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
+++ b/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
@@ -4054,7 +4054,8 @@ TEST(TestArrowReaderAdHoc, OldDataPageV2) {
     GTEST_SKIP() << "ARROW_TEST_DATA not set.";
   }
   std::stringstream ss;
-  ss << c_root << "/" << "parquet/ARROW-17100.parquet";
+  ss << c_root << "/"
+     << "parquet/ARROW-17100.parquet";
   std::string path = ss.str();
   TryReadDataFile(path);
 }

Aha, more file forgot to format

pitrou · 2024-05-22T17:23:47Z

@jorisvandenbossche What do you think about this proposed change?

pitrou · 2024-05-22T17:27:02Z

It's also weird that, when store_schema is true, we copy the Arrow metadata directly into the Parquet metadata, but AFAIU we also serialize it as part of the Arrow schema. So it will end up essentially duplicated, wasting some storage and processing power (though probably not much in most cases).

mapleFU · 2024-05-22T17:38:09Z

So it will end up essentially duplicated, wasting some storage and processing power (though probably not much in most cases).

Yeah I think previous behavior is a bit weird.

I'll try to add api in arrow-parquet and add it in Python tomorrow.

jorisvandenbossche · 2024-05-22T18:00:48Z

On the issue I mentioned #31723 that has some prior discussion about this.

I agree with the general idea here that the writing of the Arrow schema metadata to the Parquet FileMetaData key_value_metadata should not depend on the store_schema flag. If we think in general we should map our schema metadata to parquet, that should be done independent from writing an ARROW:schema key in the parquet metadata

It's also weird that, when store_schema is true, we copy the Arrow metadata directly into the Parquet metadata, but AFAIU we also serialize it as part of the Arrow schema. So it will end up essentially duplicated,

That is indeed an issue, and actually causing issues on the read side (which is what #31723 is originally about). Because at the read side we will then ignore any other keys except for ARROW:schema and the metadata included in that serialized schema.

The main problem is that just dropping the custom metadata from ARROW:schema would cause issues with compatibility (reading a file with arrow 16 would then not read any metadata)

pitrou · 2024-05-23T08:49:01Z

The main problem is that just dropping the custom metadata from ARROW:schema would cause issues with compatibility (reading a file with arrow 16 would then not read any metadata)

Ok, so we should probably fix the read side first, and then wait for a couple years before we fix the duplicated metadata issue?

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

TheNeuralBit · 2024-05-28T20:57:55Z

Thanks very much for the context @jorisvandenbossche and @pitrou. To be clear, de-duping metadata when store_schema is set is the write-side change that needs to wait for a corresponding read side change to have sufficient distribution. How should we handle this particular change (copying schema-level metadata to parquet file-level metadata independent of store_schema flag)?

If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:

unset: use value of store_schema
false: never copy schema metadata
true: always copy schema metadata

mapleFU · 2024-06-02T06:47:59Z

If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:

unset: use value of store_schema
false: never copy schema metadata
true: always copy schema metadata

Personally I'm +1 on this

pitrou · 2024-06-03T14:22:17Z

If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:
* unset: use value of store_schema
* false: never copy schema metadata
* true: always copy schema metadata

A tri-state value seems complicated and overkill IMHO, while a simple boolean flag would suffice.

jorisvandenbossche · 2024-06-06T10:15:43Z

Do we need to flag to control storing schema metadata in Parquet FileMetadata? We could also simply always do that (and only have store_schema to control whether to additionally write a ARROW:schema key), i.e. what the current state of the PR does, I think

If someone does not want to write metadata, they can always first remove the metadata from the schema before writing it to Parquet, so it is possible to control this that way (a bit more verbose, but at least on the Python side we have a helper method for removing metadata with a one liner).

thisisnic · 2025-11-18T12:30:57Z

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

Copy key value metadata when store_schema is false

a7f2ae5

TheNeuralBit requested a review from wgtmac as a code owner May 21, 2024 22:11

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels May 21, 2024

TheNeuralBit changed the title ~~Copy key value metadata when store_schema is false~~ GH-41766: [Parquet] Copy key value metadata when store_schema is false May 21, 2024

wgtmac changed the title ~~GH-41766: [Parquet] Copy key value metadata when store_schema is false~~ GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false May 22, 2024

wgtmac approved these changes May 22, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 22, 2024

fixup! Copy key value metadata when store_schema is false

a49079c

fixup! Copy key value metadata when store_schema is false

203d4ca

pitrou requested changes May 23, 2024

View reviewed changes

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated Show resolved Hide resolved

fixup! Copy key value metadata when store_schema is false

3365cbf

github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false #41769

Are you sure you want to change the base?

GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false #41769

Uh oh!

Conversation

TheNeuralBit commented May 21, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented May 21, 2024

Uh oh!

github-actions bot commented May 21, 2024

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

mapleFU commented May 22, 2024

Uh oh!

wgtmac commented May 22, 2024

Uh oh!

mapleFU commented May 22, 2024

Uh oh!

TheNeuralBit commented May 22, 2024

Uh oh!

TheNeuralBit commented May 22, 2024

Uh oh!

mapleFU commented May 22, 2024

Uh oh!

mapleFU commented May 22, 2024

Uh oh!

pitrou commented May 22, 2024

Uh oh!

pitrou commented May 22, 2024

Uh oh!

mapleFU commented May 22, 2024

Uh oh!

jorisvandenbossche commented May 22, 2024

Uh oh!

pitrou commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

TheNeuralBit commented May 28, 2024

Uh oh!

mapleFU commented Jun 2, 2024

Uh oh!

pitrou commented Jun 3, 2024

Uh oh!

jorisvandenbossche commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thisisnic commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

TheNeuralBit commented May 21, 2024 •

edited by github-actions bot

Loading

pitrou commented May 23, 2024 •

edited

Loading

jorisvandenbossche commented Jun 6, 2024 •

edited

Loading