Skip to content

Conversation

@TheNeuralBit
Copy link
Member

@TheNeuralBit TheNeuralBit commented May 21, 2024

What changes are included in this PR?

This modifies parquet::arrow::FileWriter to always copy the arrow Table's schema-level metadata into the produced parquet file. Currently FileWriter only does this if ArrowWriterProperties::store_schema is true.

Are these changes tested?

Yes

Are there any user-facing changes?

This slightly changes the behavior of parquet::arrow::FileWriter, I'm not sure if this would be considered a user-facing change or not.

@TheNeuralBit TheNeuralBit requested a review from wgtmac as a code owner May 21, 2024 22:11
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@TheNeuralBit TheNeuralBit changed the title Copy key value metadata when store_schema is false GH-41766: [Parquet] Copy key value metadata when store_schema is false May 21, 2024
@github-actions
Copy link

⚠️ GitHub issue #41766 has been automatically assigned in GitHub to PR creator.

@wgtmac wgtmac changed the title GH-41766: [Parquet] Copy key value metadata when store_schema is false GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false May 22, 2024
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 22, 2024
@mapleFU
Copy link
Member

mapleFU commented May 22, 2024

This fixing looks nice to me, but I'm a bit confused why schema.metadata() is extracted to file-metadata, @wgtmac do you know the context here?

@wgtmac
Copy link
Member

wgtmac commented May 22, 2024

This fixing looks nice to me, but I'm a bit confused why schema.metadata() is extracted to file-metadata, @wgtmac do you know the context here?

I found this from #5077:

Add ArrowWriterProperties::store_schema() option which stores the Arrow schema used to create a Parquet file in a special ARROW:schema key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future

@mapleFU
Copy link
Member

mapleFU commented May 22, 2024

I found this from #5077

But this doesn't means other metadata in "schema" should written to file?

@TheNeuralBit would you mind run clang-format for cpp/src/parquet/arrow/arrow_reader_writer_test.cc to fix the lint?

@TheNeuralBit
Copy link
Member Author

would you mind run clang-format

Done.

@TheNeuralBit
Copy link
Member Author

This fixing looks nice to me, but I'm a bit confused why schema.metadata() is extracted to file-metadata

I agree this is a little surprising. Is there some other way to use FileWriter to write key-value metadata to the parquet file metadata? This was the first method I found.

Regardless I think we should be consistent between the store_schema true and false paths. Right now one copies the schema metadata to parquet file-level metadata and one does not. An alternative could be to change the store_schema true path (and document an alternative way to write file-level metadata) but that's more likely to be a breaking change I think.

@mapleFU
Copy link
Member

mapleFU commented May 22, 2024

Is there some other way to use FileWriter to write key-value metadata to the parquet file metadata? This was the first method I found.

void AddKeyValueMetadata(
. I'm trying to add it to arrow wrapper ( #41633 ) but I'm quite busy this week, maybe I'll finish it tomorrow?

Regardless I think we should be consistent between the store_schema true and false paths.

Yeah I think this edit is great but I'm a bit confused about why it uses key-value from schema. @pitrou may I ask the reason here?

@mapleFU
Copy link
Member

mapleFU commented May 22, 2024

--- a/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
+++ b/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
@@ -4054,7 +4054,8 @@ TEST(TestArrowReaderAdHoc, OldDataPageV2) {
     GTEST_SKIP() << "ARROW_TEST_DATA not set.";
   }
   std::stringstream ss;
-  ss << c_root << "/" << "parquet/ARROW-17100.parquet";
+  ss << c_root << "/"
+     << "parquet/ARROW-17100.parquet";
   std::string path = ss.str();
   TryReadDataFile(path);
 }

Aha, more file forgot to format

@pitrou
Copy link
Member

pitrou commented May 22, 2024

@jorisvandenbossche What do you think about this proposed change?

@pitrou
Copy link
Member

pitrou commented May 22, 2024

It's also weird that, when store_schema is true, we copy the Arrow metadata directly into the Parquet metadata, but AFAIU we also serialize it as part of the Arrow schema. So it will end up essentially duplicated, wasting some storage and processing power (though probably not much in most cases).

@mapleFU
Copy link
Member

mapleFU commented May 22, 2024

So it will end up essentially duplicated, wasting some storage and processing power (though probably not much in most cases).

Yeah I think previous behavior is a bit weird.

I'll try to add api in arrow-parquet and add it in Python tomorrow.

@jorisvandenbossche
Copy link
Member

On the issue I mentioned #31723 that has some prior discussion about this.

I agree with the general idea here that the writing of the Arrow schema metadata to the Parquet FileMetaData key_value_metadata should not depend on the store_schema flag. If we think in general we should map our schema metadata to parquet, that should be done independent from writing an ARROW:schema key in the parquet metadata

It's also weird that, when store_schema is true, we copy the Arrow metadata directly into the Parquet metadata, but AFAIU we also serialize it as part of the Arrow schema. So it will end up essentially duplicated,

That is indeed an issue, and actually causing issues on the read side (which is what #31723 is originally about). Because at the read side we will then ignore any other keys except for ARROW:schema and the metadata included in that serialized schema.

The main problem is that just dropping the custom metadata from ARROW:schema would cause issues with compatibility (reading a file with arrow 16 would then not read any metadata)

@pitrou
Copy link
Member

pitrou commented May 23, 2024

The main problem is that just dropping the custom metadata from ARROW:schema would cause issues with compatibility (reading a file with arrow 16 would then not read any metadata)

Ok, so we should probably fix the read side first, and then wait for a couple years before we fix the duplicated metadata issue?

@TheNeuralBit
Copy link
Member Author

Thanks very much for the context @jorisvandenbossche and @pitrou. To be clear, de-duping metadata when store_schema is set is the write-side change that needs to wait for a corresponding read side change to have sufficient distribution. How should we handle this particular change (copying schema-level metadata to parquet file-level metadata independent of store_schema flag)?

If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:

  • unset: use value of store_schema
  • false: never copy schema metadata
  • true: always copy schema metadata

@mapleFU
Copy link
Member

mapleFU commented Jun 2, 2024

If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:

  • unset: use value of store_schema
  • false: never copy schema metadata
  • true: always copy schema metadata

Personally I'm +1 on this

@pitrou
Copy link
Member

pitrou commented Jun 3, 2024

If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:

* unset: use value of store_schema
* false: never copy schema metadata
* true: always copy schema metadata

A tri-state value seems complicated and overkill IMHO, while a simple boolean flag would suffice.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 6, 2024

Do we need to flag to control storing schema metadata in Parquet FileMetadata? We could also simply always do that (and only have store_schema to control whether to additionally write a ARROW:schema key), i.e. what the current state of the PR does, I think

If someone does not want to write metadata, they can always first remove the metadata from the schema before writing it to Parquet, so it is possible to control this that way (a bit more verbose, but at least on the Python side we have a helper method for removing metadata with a one liner).

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@thisisnic
Copy link
Member

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review Component: C++ Component: Parquet Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants