-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47338: [C++][Python] Remove deprecated string-based Parquet encryption methods #47339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@pitrou Here is the follow-up on migrating Parquet encryption to There is no change to the Python API and only minimal glue-code for Python KMS client to work with the new C++ KMS client API. Rolling secure strings out to PyArrow is out of scope of this PR. |
| cdef extern from "Python.h": | ||
| # To let us get a PyObject* and avoid Cython auto-ref-counting | ||
| PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( | ||
| char *v, Py_ssize_t len) except NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised that this declaration is required. Cython already provides a declaration for this function and it would be surprised if it didn't have the right ref-counting behavior.
(also, by the way, if we require Cython 3.1 we will gain implicit std::string_view conversion to/from bytes objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing PyBytes_FromStringAndSize via from cpython.bytes cimport PyBytes_FromStringAndSize works.
I have copied this from io.pxi, maybe it is not needed there as well / any more:
Lines 43 to 46 in cdc1459
| cdef extern from "Python.h": | |
| # To let us get a PyObject* and avoid Cython auto-ref-counting | |
| PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( | |
| char *v, Py_ssize_t len) except NULL |
| cdef: | ||
| cpp_string_view view = key.as_view() | ||
| key_bytes = PyObject_to_object( | ||
| PyBytes_FromStringAndSizeNative(view.data(), view.size())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course this means the bytes object payload won't be cleared securely like the SecureString is.
We could try to tackle this as a followup issue/PR. bytes objects being immutable, this might be a bit delicate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, even the cryptography package doesn't do anything about it, so we may just have to live with it:
https://cryptography.io/en/latest/limitations/#secure-memory-wiping
However,
cryptographydoes not clear memory by default, as there is no way to clear immutable structures such as bytes. As a result,cryptography, like almost all software in Python is potentially vulnerable to this attack. The CERT secure coding guidelines assesses this issue as “Severity: medium, Likelihood: unlikely, Remediation Cost: expensive to repair” and we do not consider this a high risk for most users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I deliberately left this piece of work for a separate PR.
Happy to see this as "Likelihood: unlikely, Remediation Cost: expensive to repair" as well.
|
|
||
| cdef: | ||
| c_string cstr = tobytes(key) | ||
| shared_ptr[CSecureString] css = shared_ptr[CSecureString](new CSecureString(move(cstr))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to find if modern Cython allows us to make use of make_shared. At least it has a declaration for it:
https://github.com/cython/cython/blob/b862800bda6cc5f540d3b7961350cd2b6ad00dbd/Cython/Includes/libcpp/memory.pxd#L106-L107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works, this is much nicer, thanks!
| cdef void _cb_unwrap_key( | ||
| handler, const c_string& wrapped_key, | ||
| const c_string& master_key_identifier, c_string* out) except *: | ||
| const c_string& master_key_identifier, shared_ptr[CSecureString]* out) except *: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this is expecting a shared_ptr output, is it because of Cython's limited support for move-only types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it work with CSecureString* out and CSecureString& out. What do you prefer?
| key = handler.unwrap_key(wk_str, mkid_str) | ||
| out[0] = tobytes(key) | ||
|
|
||
| cdef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a cdef: block is really required, especially as this is already a cdef function. Cython nowadays is more flexible than it used to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython doesn't like the definition of a C_string:
Error compiling Cython file:
------------------------------------------------------------
...
cdef void _cb_unwrap_key(
handler, const c_string& wrapped_key,
const c_string& master_key_identifier, CSecureString& out) except *:
c_string cstr
^
------------------------------------------------------------
pyarrow/_parquet_encryption.pyx:323:13: Syntax error in simple statement list
Without the cdef of c_string cstr, it fails with
Error compiling Cython file:
------------------------------------------------------------
...
const c_string& master_key_identifier, CSecureString& out) except *:
mkid_str = frombytes(master_key_identifier)
wk_str = frombytes(wrapped_key)
key = handler.unwrap_key(wk_str, mkid_str)
cstr = tobytes(key)
out = CSecureString(move(cstr))
^
------------------------------------------------------------
pyarrow/_parquet_encryption.pyx:327:23: no suitable method found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, cdefs can be removed by using proper casting / c-type annotation: 79cd06e
|
Apart from the comments above, these changes look ok. |
|
All comments addressed. Diff is much cleaner now. |
|
Hmm, can you rebase/merge from main to fix most of the macOS CI builds? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the most part, just one request
79cd06e to
fb5a9de
Compare
|
Comments addressed |
fb5a9de to
b125031
Compare
|
I rebased on main to get up-to-date CI results. |
|
Thanks a lot again @EnricoMi :) |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 403ba70. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
…encryption methods (apache#47339) ### Rationale for this change Passing encryption keys to Parquet via strings is insecure. Users are encouraged to use the `SecureString` based methods introduced in apache#46017 instead. To enforce this migration, deprecated methods are removed. ### What changes are included in this PR? Remove deprecated string-based methods. ### Are these changes tested? Existing Parquet encryption tests. ### Are there any user-facing changes? C++ user code that passes encryption keys to Parquet is affected and has to be adjusted. Python user code is not affected. **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#47338 Authored-by: Enrico Minack <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Passing encryption keys to Parquet via strings is insecure. Users are encouraged to use the
SecureStringbased methods introduced in #46017 instead. To enforce this migration, deprecated methods are removed.What changes are included in this PR?
Remove deprecated string-based methods.
Are these changes tested?
Existing Parquet encryption tests.
Are there any user-facing changes?
C++ user code that passes encryption keys to Parquet is affected and has to be adjusted. Python user code is not affected.
This PR includes breaking changes to public APIs.