Skip to content

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Aug 15, 2025

Rationale for this change

Passing encryption keys to Parquet via strings is insecure. Users are encouraged to use the SecureString based methods introduced in #46017 instead. To enforce this migration, deprecated methods are removed.

What changes are included in this PR?

Remove deprecated string-based methods.

Are these changes tested?

Existing Parquet encryption tests.

Are there any user-facing changes?

C++ user code that passes encryption keys to Parquet is affected and has to be adjusted. Python user code is not affected.

This PR includes breaking changes to public APIs.

@github-actions
Copy link

⚠️ GitHub issue #47338 has been automatically assigned in GitHub to PR creator.

@EnricoMi
Copy link
Contributor Author

@pitrou Here is the follow-up on migrating Parquet encryption to SecureStrings. The first commit has been approved as part of #46017: #46017 (comment)

There is no change to the Python API and only minimal glue-code for Python KMS client to work with the new C++ KMS client API. Rolling secure strings out to PyArrow is out of scope of this PR.

Comment on lines 31 to 34
cdef extern from "Python.h":
# To let us get a PyObject* and avoid Cython auto-ref-counting
PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"(
char *v, Py_ssize_t len) except NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that this declaration is required. Cython already provides a declaration for this function and it would be surprised if it didn't have the right ref-counting behavior.

(also, by the way, if we require Cython 3.1 we will gain implicit std::string_view conversion to/from bytes objects)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing PyBytes_FromStringAndSize via from cpython.bytes cimport PyBytes_FromStringAndSize works.

I have copied this from io.pxi, maybe it is not needed there as well / any more:

cdef extern from "Python.h":
# To let us get a PyObject* and avoid Cython auto-ref-counting
PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"(
char *v, Py_ssize_t len) except NULL

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 19, 2025
cdef:
cpp_string_view view = key.as_view()
key_bytes = PyObject_to_object(
PyBytes_FromStringAndSizeNative(view.data(), view.size()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course this means the bytes object payload won't be cleared securely like the SecureString is.

We could try to tackle this as a followup issue/PR. bytes objects being immutable, this might be a bit delicate...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, even the cryptography package doesn't do anything about it, so we may just have to live with it:
https://cryptography.io/en/latest/limitations/#secure-memory-wiping

However, cryptography does not clear memory by default, as there is no way to clear immutable structures such as bytes. As a result, cryptography, like almost all software in Python is potentially vulnerable to this attack. The CERT secure coding guidelines assesses this issue as “Severity: medium, Likelihood: unlikely, Remediation Cost: expensive to repair” and we do not consider this a high risk for most users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I deliberately left this piece of work for a separate PR.

Happy to see this as "Likelihood: unlikely, Remediation Cost: expensive to repair" as well.


cdef:
c_string cstr = tobytes(key)
shared_ptr[CSecureString] css = shared_ptr[CSecureString](new CSecureString(move(cstr)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to find if modern Cython allows us to make use of make_shared. At least it has a declaration for it:
https://github.com/cython/cython/blob/b862800bda6cc5f540d3b7961350cd2b6ad00dbd/Cython/Includes/libcpp/memory.pxd#L106-L107

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works, this is much nicer, thanks!

cdef void _cb_unwrap_key(
handler, const c_string& wrapped_key,
const c_string& master_key_identifier, c_string* out) except *:
const c_string& master_key_identifier, shared_ptr[CSecureString]* out) except *:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this is expecting a shared_ptr output, is it because of Cython's limited support for move-only types?

Copy link
Contributor Author

@EnricoMi EnricoMi Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it work with CSecureString* out and CSecureString& out. What do you prefer?

key = handler.unwrap_key(wk_str, mkid_str)
out[0] = tobytes(key)

cdef:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a cdef: block is really required, especially as this is already a cdef function. Cython nowadays is more flexible than it used to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython doesn't like the definition of a C_string:

Error compiling Cython file:
------------------------------------------------------------
...


cdef void _cb_unwrap_key(
        handler, const c_string& wrapped_key,
        const c_string& master_key_identifier, CSecureString& out) except *:
    c_string cstr
             ^
------------------------------------------------------------

pyarrow/_parquet_encryption.pyx:323:13: Syntax error in simple statement list

Without the cdef of c_string cstr, it fails with

Error compiling Cython file:
------------------------------------------------------------
...
        const c_string& master_key_identifier, CSecureString& out) except *:
    mkid_str = frombytes(master_key_identifier)
    wk_str = frombytes(wrapped_key)
    key = handler.unwrap_key(wk_str, mkid_str)
    cstr = tobytes(key)
    out = CSecureString(move(cstr))
                       ^
------------------------------------------------------------

pyarrow/_parquet_encryption.pyx:327:23: no suitable method found

Copy link
Contributor Author

@EnricoMi EnricoMi Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, cdefs can be removed by using proper casting / c-type annotation: 79cd06e

@pitrou
Copy link
Member

pitrou commented Aug 19, 2025

Apart from the comments above, these changes look ok.

@EnricoMi
Copy link
Contributor Author

All comments addressed. Diff is much cleaner now.

@pitrou
Copy link
Member

pitrou commented Aug 26, 2025

Hmm, can you rebase/merge from main to fix most of the macOS CI builds?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the most part, just one request

@EnricoMi EnricoMi force-pushed the secure-string-python branch from 79cd06e to fb5a9de Compare August 27, 2025 11:38
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Sep 4, 2025

Comments addressed

@pitrou pitrou force-pushed the secure-string-python branch from fb5a9de to b125031 Compare September 4, 2025 12:15
@pitrou
Copy link
Member

pitrou commented Sep 4, 2025

I rebased on main to get up-to-date CI results.

@pitrou pitrou merged commit 403ba70 into apache:main Sep 4, 2025
34 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Sep 4, 2025
@pitrou
Copy link
Member

pitrou commented Sep 4, 2025

Thanks a lot again @EnricoMi :)

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 403ba70.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Oct 15, 2025
…encryption methods (apache#47339)

### Rationale for this change
Passing encryption keys to Parquet via strings is insecure. Users are encouraged to use the `SecureString` based methods introduced in apache#46017 instead. To enforce this migration, deprecated methods are removed.

### What changes are included in this PR?
Remove deprecated string-based methods.

### Are these changes tested?
Existing Parquet encryption tests.

### Are there any user-facing changes?
C++ user code that passes encryption keys to Parquet is affected and has to be adjusted. Python user code is not affected.

**This PR includes breaking changes to public APIs.**
* GitHub Issue: apache#47338

Authored-by: Enrico Minack <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants