-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-31603: [C++] Wrap Parquet encryption keys in SecureString #46017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
|
|
|
|
@EnricoMi Do you want a preliminary review on this? |
|
For the record, OpenSSL allows allocating memory from a secure heap, but that requires calling process-wide initialization functions. I've asked about this. |
Yes, please! This is in a complete PoC state (Linux build and test is green). Is this the right direction before I roll this out to Python? |
|
I agree, that |
Yes!
I would rather deprecate the existing methods.
You probably want to add a
Well, currently the OpenSSL function is very crude :)
What I'm worried about is different libraries in the same process all trying to use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more stringent testing, I think we also want to examine the string area before it is cleared (or moved, etc.).
Something like (untested):
std::string_view StringArea(const std::string& string) {
return std::string_view(string.data(), string.capacity());
}
void AssertSecurelyCleared(std::string_view area) {
// the entire area is filled with zeros
std::string zeros(area.size(), '\0');
ASSERT_EQ(area, std::string_view(zeros));
}
void AssertSecurelyCleared(const std::string& string) {
AssertSecurelyCleared(StringArea(string));
}
TEST(TestSecureString, SecureClearCheck) {
// short string
{
std::string tiny("abc");
auto old_area = StringArea(tiny);
SecureString::SecureClear(tiny);
AssertSecurelyCleared(tiny);
AssertSecurelyCleared(old_area);
}
// etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But tiny.data() and old_area.data() point to the same address, what is the point testing old_area? Just in case pointers change? If they would change, then old_area would point to a freed memory area, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @EnricoMi , I realize I forgot to reply to this comment:
Just in case pointers change? If they would change, then
old_areawould point to a freed memory area, right?
Right, but the freed memory area was not necessarily cleaned up before being freed (unless the allocator was built in debug mode perhaps). That's why it's important to check that the old area doesn't contain the sensitive secrets anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also want a test that a SecureString was cleared on destruction ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would I do that, the memory where SecureString.as_span points to will be freed on deconstruction. How can I access that memory securely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standardese "true" way would be to template SecureString on an allocator so that you could intercept destruction of the bytes it stores. That's entirely too much here; we could just add a debug assertion inside ~SecureString (maybe only enabled by an env var or #if defined(ARROW_ASSERT_SECURE_STRING_ZEROED))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, alternatively: since every STL I'm aware of implements std::string with an inline short string case, we could use:
// Most std::string impls store their contents inline, so we can check
// the object representation for a crib to ensure that destruction does
// wipe all sensitive bytes.
constexpr auto kCrib = "01CRIB01";
alignas(SecureString) char object_repr[sizeof(SecureString)];
std::string_view object_repr_view{object_repr, sizeof(object_repr)};
SecureString* cribbed = new (&object_repr) SecureString(std::string(kCrib));
bool crib_found_before = object_repr_view.find(kCrib) != object_repr_view.npos;
cribbed->~SecureString(); // NOTE: do not allow this call to be skipped!
bool crib_found_after = object_repr_view.find(kCrib) != object_repr_view.npos;
if (crib_found_before) {
ASSERT_FALSE(crib_found_after);
}I think that's well-defined: https://godbolt.org/z/31cY59EM3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would I do that, the memory where
SecureString.as_spanpoints to will be freed on deconstruction. How can I access that memory securely?
Depends what you call "securely", but in most cases (unless running under Valgrind or Address Sanitizer), you can still access a deallocated area. We should just try to do that here.
(this is morally the same as with the old area comment, by the way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is empty_string_ useful at all? It looks rather dubious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, method FileDecryptionProperties::column_key returns an empty string as the encryption key when the key id is not found. It should rather throw a meaningful key not found exception. I may fix that while I am touching this code anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a holdover from the first draft of encryption; IIRC column_key() returned references to strings at that time (so return ""; would fail to compile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needed, improved name.
9848539 to
b303afa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a holdover from the first draft of encryption; IIRC column_key() returned references to strings at that time (so return ""; would fail to compile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const seems warranted here
| SecureString kFooterEncryptionKey_ = kFooterEncryptionKey; | |
| const SecureString kFooterEncryptionKey_ = kFooterEncryptionKey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that hold for all k... members in this class (TestDecryptionConfiguration) and TestEncryptionConfiguration as well? Is it worth it given this are tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems questionable to me that the ultimate fallback should not be a simple loop
| __asm__ __volatile__("" ::"r"(data) : "memory"); | |
| for (size_t i = 0; i < size; ++i) { | |
| data[i] = 0; | |
| } |
I don't know how cross platform this __asm__ block is.
Also, since this code is borrowed from another project it needs to be mentioned in LICENSE.txt, for example
Lines 1541 to 1543 in 7df396e
| A function gettimeofday in utilities.cc is based on | |
| http://www.google.com/codesearch/p?hl=en#dR3YEbitojA/COPYING&q=GetSystemTimeAsFileTime%20license:bsd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
volatile char *p = data;
while (size--) *p++ = 0;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given https://github.com/BLAKE2/libb2/blob/master/src/blake2-impl.h is CC0 Public Domain Dedication, and our code deviates a lot, you could say it is as much inspired by that implementation than it is by all those variations available on stackoverflow.
I don't think this qualifies as a copy or derived work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's in the public domain, then I imagine an inline comment referencing that should be sufficient. @pitrou do you agree? We do also account for some public domain artifacts in LICENSE.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a comment, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more comments and links to prior art. I have also wrapped the __asm__ to enable it only for GCC and Clang: 510349c
We have two options for the fallback branch:
- the current
memsetimplementation - the
whileloop below
Both benefit from the __asm__ __volatile__ call.
// Volatile pointer to the data is an attempt to avoid
// that the compiler optimizes away the while loop.
// https://cplusplus.com/articles/ETqpX9L8/
// http://stackoverflow.com/a/13299459/965097
// https://www.cryptologie.net/article/419/zeroing-memory-compiler-optimizations-and-memset_s/
volatile char *p = data;
while (size--) *p++ = 0;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the while loop is preferable; it's no more lines of code and is probably more obvious to readers.
Currently the memset_v block doesn't exclude the inline asm block
arrow/cpp/src/arrow/util/secure_string.cc
Lines 123 to 127 in 219d207
| // https://github.com/openssl/openssl/blob/3423c30db3aa044f46e1f0270e2ecd899415bf5f/crypto/mem_clr.c#L22 | |
| static const volatile auto memset_v = &memset; | |
| memset_v(data, 0, size); | |
| # if defined(__GNUC__) || defined(__clang__) |
I'd recommend rewriting to
#if (inline_asm_conditions)
// inline asm
#else
// while loop
#endifthe compiler could ordinarily be relied on to omit the redundant write, but since we're handling volatile memory we'll probably generate code which zeroes twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the memset_v as it is more performant. Either memset or loop will require verbose comment to explain the meaning of the volatile, so I think both are similar "obvious" to readers.
I don't understand your comment on the inline asm block. The asm block is a NOOP, it does not zero any memory. It is merely there to tell the compiler that it cannot optimize away any code that modified data before this line. So it is not either asm or loop, it is either (memset or loop) and asm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standardese "true" way would be to template SecureString on an allocator so that you could intercept destruction of the bytes it stores. That's entirely too much here; we could just add a debug assertion inside ~SecureString (maybe only enabled by an env var or #if defined(ARROW_ASSERT_SECURE_STRING_ZEROED))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please move this to arrow/util/? I think it will be useful outside parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have extracted the SecureString implementation into a separate PR: #46626. That targets arrow/util.
This PR will focus in migrating parquet encryption to that SecureString implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looked suspicious, all other setter methods check if the member is unset.
Here, the check is always true as it checks the input.
This is a breaking change if user code calls this setter twice with non-empty keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've corrected a bug; the intention with these setter methods seems to be one-time setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are still returning a reference to a SecureString. I tried to return some std::optional, to replace empty SecureString notation for "no key given" with std::nullopt, but this opened a can of worms. Not going down this rabbit whole in this PR.
At least the naming is improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another use for a generic default reference: #46303 (comment)
|
@pitrou I have removed the deprecated methods to "encourage" users to migrate to the new methods: 290449a But this will require migrating the Python API as well: I would like to move touching the Python API to a separate PR, which would contain the revert 290449a. |
Ok, that sounds good to me. |
Alright, once you approve this PR, I will revert 290449a and we are ready to merge. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except two minor comments
|
|
||
| // any empty SecureString key is interpreted as if no key is given | ||
| // this instance is used if a SecureString reference is returned | ||
| const ::arrow::util::SecureString no_key_ = ::arrow::util::SecureString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be an instance variable? It could just be a private constant inside encryption.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 184afcf.
| void PutKey(uint32_t key_id, ::arrow::util::SecureString key); | ||
|
|
||
| ::arrow::util::SecureString GetKey(const std::string& key_id_string) override { | ||
| // key_id_string is string but for IntegerKeyIdRetriever it encodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth inlining this method, is it? Especially as it will usually be called through the vtable, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz suggested the implementation given the signature is somewhat surprising: #46017 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I do find the implementation weird as well, and I'm not sure this class is actually useful, but hey :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed surprising...
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just took a quick pass on it. Thanks!
| void PutKey(uint32_t key_id, ::arrow::util::SecureString key); | ||
|
|
||
| ::arrow::util::SecureString GetKey(const std::string& key_id_string) override { | ||
| // key_id_string is string but for IntegerKeyIdRetriever it encodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed surprising...
|
@EnricoMi We have two +1s after a meticulous review pass, so perhaps you can go ahead and revert the compatibility-breaking changes so that we get a clean CI and merge. |
This reverts commit 290449a.
|
✔️ All green. |
|
@github-actions crossbow submit -g cpp |
|
Revision: 04b15e2 Submitted crossbow builds: ursacomputing/crossbow @ actions-12f2531100 |
Rationale for this change
Cryptographic keys must be kept private. Using the new
arrow::util::SecureStringis vital for storing secrets securely.What changes are included in this PR?
Uses the
arrow::util::SecureStringintroduced in #46626 for cryptographic keys throughout Parquet encryption.Are these changes tested?
Unit tests.
Are there any user-facing changes?
APIs that hand over secrets to Arrow require the secret to be encapsulated in a
SecureString.This PR includes breaking changes to public APIs.
TODO:
Supersedes #12890.