Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented May 13, 2024

Rationale for this change

The previous pr ( #34889 ) add a AddKeyValueMetadata to FileWriter. And now we should export it to Parquet Arrow and Python API.

What changes are included in this PR?

  1. Add AddKeyValueMetadata in parquet::arrow
  2. Add add_key_value_metadata in pyarrow
  3. testing

Are these changes tested?

Yes

Are there any user-facing changes?

New api allowing add key-value metadata to Parquet file

@github-actions
Copy link

⚠️ GitHub issue #41608 has been automatically assigned in GitHub to PR creator.

class ArrowReaderProperties;

class WriterProperties;
class WriterPropertiesBuilder;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually have this class, lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, maybe this is for R language ( see r/src/parquet.cpp), revert it back

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 24, 2024
@mapleFU mapleFU changed the title GH-41608: [C++][Python][Parquet] Extends the append_key_value to parquet::arrow and PySDK GH-41608: [C++][Python][Parquet] Extends the append_key_value to parquet::arrow and PyArrow May 24, 2024
@mapleFU mapleFU requested review from AlenkaF and pitrou May 24, 2024 16:15
@mapleFU mapleFU marked this pull request as ready for review May 24, 2024 16:15
@mapleFU mapleFU requested a review from wgtmac as a code owner May 24, 2024 16:15
@mapleFU mapleFU force-pushed the parquet-export-kvmeta-to-arrow-and-python branch from 034d5f8 to b319fc8 Compare May 24, 2024 16:20
@mapleFU mapleFU changed the title GH-41608: [C++][Python][Parquet] Extends the append_key_value to parquet::arrow and PyArrow GH-41608: [C++][Python][Parquet] Extends the add_key_value to parquet::arrow and PyArrow May 24, 2024
@mapleFU mapleFU changed the title GH-41608: [C++][Python][Parquet] Extends the add_key_value to parquet::arrow and PyArrow GH-41608: [C++][Python] Extends the add_key_value to parquet::arrow and PyArrow May 24, 2024
@mapleFU mapleFU force-pushed the parquet-export-kvmeta-to-arrow-and-python branch from b319fc8 to 48d9cfb Compare May 24, 2024 17:03
///
/// WARNING: If `store_schema` is enabled, `ARROW:schema` would be stored
/// in the key-value metadata. Overwriting this key would result in
/// `store_schema` unusable during read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// `store_schema` unusable during read.
/// undefined behavior of `ARROW:schema` during read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this cannot be used here? And if schema cannot being read, it would use "physical" schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use the loaded term "undefined behavior" here? I think the original wording is mostly fine, e.g.:

Suggested change
/// `store_schema` unusable during read.
/// `store_schema` being unusable during read.

@mapleFU
Copy link
Member Author

mapleFU commented May 27, 2024

I've no idea why R language CI failed


auto kv_meta = std::make_shared<KeyValueMetadata>();
kv_meta->Append("test_key_1", "test_value_1");
kv_meta->Append("test_key_2", "test_value_2_");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between test_value_2 and test_value_2_ is very obscure and easy to miss. Can you make this test clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  // <test_key_2, test_value_2_temp> would be overwritten later.
  kv_meta->Append("test_key_2", "test_value_2_temp");

Change to the code here

const auto& key_value_metadata = writer->metadata()->key_value_metadata();
ASSERT_TRUE(nullptr != key_value_metadata);

// Verify keys that were added before file writer was closed are present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead read the file to make sure the metadata is here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I don't think they would be different. I'll verify them twice ( by read from writer->metadata()->key_value_metadata() and file)

///
/// WARNING: If `store_schema` is enabled, `ARROW:schema` would be stored
/// in the key-value metadata. Overwriting this key would result in
/// `store_schema` unusable during read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use the loaded term "undefined behavior" here? I think the original wording is mostly fine, e.g.:

Suggested change
/// `store_schema` unusable during read.
/// `store_schema` being unusable during read.

@pitrou
Copy link
Member

pitrou commented May 28, 2024

@jorisvandenbossche Could you perhaps review the Cython/Python parts of this PR?

@mapleFU
Copy link
Member Author

mapleFU commented Jun 4, 2024

@AlenkaF @jorisvandenbossche @pitrou Would you mind take a look?

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this!

Python/Cython part LGTM 👍 with only one suggestion: I think the python test fits better in tests/parquet/test_parquet_writer.py.

@mapleFU
Copy link
Member Author

mapleFU commented Jun 4, 2024

Migrate to test_parquet_writer.py now. @pitrou would you mind revisit C++ part or merge this?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two nits

COutputStream, CCacheOptions,
TimeUnit, CRecordBatchReader)
from pyarrow.lib cimport _Weakrefable
from pyarrow.lib cimport (_Weakrefable, pyarrow_unwrap_metadata, KeyValueMetadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the new imports are only required in _parquet.pyx AFAICT.

Parameters
----------
key_value_metadata : {Key, Value}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reuse the same conventions as in other docstrings:

Suggested change
key_value_metadata : {Key, Value}
key_value_metadata : dict
Keys and values must be string-like / coercible to bytes

@mapleFU
Copy link
Member Author

mapleFU commented Jun 4, 2024

@pitrou comment fixed

@pitrou
Copy link
Member

pitrou commented Jun 4, 2024

@mapleFU Feel free to merge if CI is fine.

@mapleFU
Copy link
Member Author

mapleFU commented Jun 4, 2024

CI failed is unrelated, merge

@mapleFU mapleFU merged commit d02a91b into apache:main Jun 4, 2024
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Jun 4, 2024
@mapleFU mapleFU deleted the parquet-export-kvmeta-to-arrow-and-python branch June 4, 2024 14:41
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d02a91b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants