-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41766: [C++][Parquet] Copy key value metadata when store_schema is false #41769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or In the case of PARQUET issues on JIRA the title also supports: See also: |
|
|
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
This fixing looks nice to me, but I'm a bit confused why |
I found this from #5077: |
But this doesn't means other metadata in "schema" should written to file? @TheNeuralBit would you mind run clang-format for |
Done. |
I agree this is a little surprising. Is there some other way to use Regardless I think we should be consistent between the store_schema true and false paths. Right now one copies the schema metadata to parquet file-level metadata and one does not. An alternative could be to change the store_schema true path (and document an alternative way to write file-level metadata) but that's more likely to be a breaking change I think. |
arrow/cpp/src/parquet/file_writer.h Line 207 in 8967ddc
Yeah I think this edit is great but I'm a bit confused about why it uses key-value from schema. @pitrou may I ask the reason here? |
Aha, more file forgot to format |
|
@jorisvandenbossche What do you think about this proposed change? |
|
It's also weird that, when |
Yeah I think previous behavior is a bit weird. I'll try to add api in arrow-parquet and add it in Python tomorrow. |
|
On the issue I mentioned #31723 that has some prior discussion about this. I agree with the general idea here that the writing of the Arrow schema metadata to the Parquet FileMetaData
That is indeed an issue, and actually causing issues on the read side (which is what #31723 is originally about). Because at the read side we will then ignore any other keys except for ARROW:schema and the metadata included in that serialized schema. The main problem is that just dropping the custom metadata from ARROW:schema would cause issues with compatibility (reading a file with arrow 16 would then not read any metadata) |
Ok, so we should probably fix the read side first, and then wait for a couple years before we fix the duplicated metadata issue? |
|
Thanks very much for the context @jorisvandenbossche and @pitrou. To be clear, de-duping metadata when store_schema is set is the write-side change that needs to wait for a corresponding read side change to have sufficient distribution. How should we handle this particular change (copying schema-level metadata to parquet file-level metadata independent of store_schema flag)? If there's concern over opting everyone in to this I could add another flag in ArrowWriterProperties, as suggested in #31723. It could be a tri-state to maintain backward compatibility:
|
Personally I'm +1 on this |
A tri-state value seems complicated and overkill IMHO, while a simple boolean flag would suffice. |
|
Do we need to flag to control storing schema metadata in Parquet FileMetadata? We could also simply always do that (and only have If someone does not want to write metadata, they can always first remove the metadata from the schema before writing it to Parquet, so it is possible to control this that way (a bit more verbose, but at least on the Python side we have a helper method for removing metadata with a one liner). |
|
Thank you for your contribution. Unfortunately, this |
What changes are included in this PR?
This modifies
parquet::arrow::FileWriterto always copy the arrow Table's schema-level metadata into the produced parquet file. CurrentlyFileWriteronly does this ifArrowWriterProperties::store_schemais true.Are these changes tested?
Yes
Are there any user-facing changes?
This slightly changes the behavior of
parquet::arrow::FileWriter, I'm not sure if this would be considered a user-facing change or not.parquet::arrow::FileWriterdoes not propagate schema-level metadata whenArrowWriterProperties::store_schemais false #41766