-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41608: [C++][Python] Extends the add_key_value to parquet::arrow and PyArrow #41633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41608: [C++][Python] Extends the add_key_value to parquet::arrow and PyArrow #41633
Conversation
|
|
| class ArrowReaderProperties; | ||
|
|
||
| class WriterProperties; | ||
| class WriterPropertiesBuilder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually have this class, lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh, maybe this is for R language ( see r/src/parquet.cpp), revert it back
034d5f8 to
b319fc8
Compare
b319fc8 to
48d9cfb
Compare
cpp/src/parquet/arrow/writer.h
Outdated
| /// | ||
| /// WARNING: If `store_schema` is enabled, `ARROW:schema` would be stored | ||
| /// in the key-value metadata. Overwriting this key would result in | ||
| /// `store_schema` unusable during read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// `store_schema` unusable during read. | |
| /// undefined behavior of `ARROW:schema` during read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this cannot be used here? And if schema cannot being read, it would use "physical" schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use the loaded term "undefined behavior" here? I think the original wording is mostly fine, e.g.:
| /// `store_schema` unusable during read. | |
| /// `store_schema` being unusable during read. |
|
I've no idea why R language CI failed |
|
|
||
| auto kv_meta = std::make_shared<KeyValueMetadata>(); | ||
| kv_meta->Append("test_key_1", "test_value_1"); | ||
| kv_meta->Append("test_key_2", "test_value_2_"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between test_value_2 and test_value_2_ is very obscure and easy to miss. Can you make this test clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// <test_key_2, test_value_2_temp> would be overwritten later.
kv_meta->Append("test_key_2", "test_value_2_temp");
Change to the code here
| const auto& key_value_metadata = writer->metadata()->key_value_metadata(); | ||
| ASSERT_TRUE(nullptr != key_value_metadata); | ||
|
|
||
| // Verify keys that were added before file writer was closed are present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead read the file to make sure the metadata is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I don't think they would be different. I'll verify them twice ( by read from writer->metadata()->key_value_metadata() and file)
cpp/src/parquet/arrow/writer.h
Outdated
| /// | ||
| /// WARNING: If `store_schema` is enabled, `ARROW:schema` would be stored | ||
| /// in the key-value metadata. Overwriting this key would result in | ||
| /// `store_schema` unusable during read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use the loaded term "undefined behavior" here? I think the original wording is mostly fine, e.g.:
| /// `store_schema` unusable during read. | |
| /// `store_schema` being unusable during read. |
|
@jorisvandenbossche Could you perhaps review the Cython/Python parts of this PR? |
|
@AlenkaF @jorisvandenbossche @pitrou Would you mind take a look? |
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this!
Python/Cython part LGTM 👍 with only one suggestion: I think the python test fits better in tests/parquet/test_parquet_writer.py.
|
Migrate to |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just two nits
python/pyarrow/_parquet.pxd
Outdated
| COutputStream, CCacheOptions, | ||
| TimeUnit, CRecordBatchReader) | ||
| from pyarrow.lib cimport _Weakrefable | ||
| from pyarrow.lib cimport (_Weakrefable, pyarrow_unwrap_metadata, KeyValueMetadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the new imports are only required in _parquet.pyx AFAICT.
python/pyarrow/parquet/core.py
Outdated
| Parameters | ||
| ---------- | ||
| key_value_metadata : {Key, Value} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reuse the same conventions as in other docstrings:
| key_value_metadata : {Key, Value} | |
| key_value_metadata : dict | |
| Keys and values must be string-like / coercible to bytes |
|
@pitrou comment fixed |
|
@mapleFU Feel free to merge if CI is fine. |
|
CI failed is unrelated, merge |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d02a91b. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
The previous pr ( #34889 ) add a
AddKeyValueMetadatato FileWriter. And now we should export it to Parquet Arrow and Python API.What changes are included in this PR?
AddKeyValueMetadatain parquet::arrowadd_key_value_metadatain pyarrowAre these changes tested?
Yes
Are there any user-facing changes?
New api allowing add key-value metadata to Parquet file