Skip to content

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Apr 3, 2025

Rationale for this change

Cryptographic keys must be kept private. Using the new arrow::util::SecureString is vital for storing secrets securely.

What changes are included in this PR?

Uses the arrow::util::SecureString introduced in #46626 for cryptographic keys throughout Parquet encryption.

Are these changes tested?

Unit tests.

Are there any user-facing changes?

APIs that hand over secrets to Arrow require the secret to be encapsulated in a SecureString.

This PR includes breaking changes to public APIs.

TODO:

  • provide instructions for migration

Supersedes #12890.

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

⚠️ GitHub issue #31603 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

⚠️ GitHub issue #31603 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

⚠️ GitHub issue #31603 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Apr 3, 2025

@EnricoMi Do you want a preliminary review on this?

@pitrou
Copy link
Member

pitrou commented Apr 3, 2025

For the record, OpenSSL allows allocating memory from a secure heap, but that requires calling process-wide initialization functions. I've asked about this.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 3, 2025

@EnricoMi Do you want a preliminary review on this?

Yes, please! This is in a complete PoC state (Linux build and test is green).

Is this the right direction before I roll this out to Python?
Can I break the API or should I deprecate exiting methods and provide both, std::string and secure string methods?
Do you have some hints how I can fix the link issues for Windows? I presume I need to add the secure_string.cc file somewhere so Windows links to the object file.
Shouldn't it be sufficient to only use OPENSSL_cleanse and count on OpenSSL providing OS-dependent implementations?

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Apr 3, 2025

I agree, that OPENSSL_secure_malloc seems problematic. What if the application crashes (or receives kill -9) and CRYPTO_secure_malloc_done cannot be called. Are secrets then not protected? If secrets are protected without CRYPTO_secure_malloc_done, and multiple calls to OPENSSL_secure_malloc are idempotent, then this would be a good way to go.

@pitrou
Copy link
Member

pitrou commented Apr 3, 2025

Is this the right direction before I roll this out to Python?

Yes!

Can I break the API or should I deprecate exiting methods and provide both, std::string and secure string methods?

I would rather deprecate the existing methods.

Do you have some hints how I can fix the link issues for Windows?

You probably want to add a PARQUET_EXPORT annotation to the SecureString class?

Shouldn't it be sufficient to only use OPENSSL_cleanse and count on OpenSSL providing OS-dependent implementations?

Well, currently the OpenSSL function is very crude :)
https://github.com/openssl/openssl/blob/c66e00398c9feabc02ff6e678089a3dc95f985d2/crypto/mem_clr.c#L18-L25

If secrets are protected without CRYPTO_secure_malloc_done, and multiple calls to OPENSSL_secure_malloc are idempotent, then this would be a good way to go.

What I'm worried about is different libraries in the same process all trying to use the CRYPTO_secure_malloc APIs, or perhaps even one library calling CRYPTO_secure_malloc_done while another library is still using those API calls.

Comment on lines 38 to 47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more stringent testing, I think we also want to examine the string area before it is cleared (or moved, etc.).

Something like (untested):

std::string_view StringArea(const std::string& string) {
  return std::string_view(string.data(), string.capacity());
}

void AssertSecurelyCleared(std::string_view area) {
  // the entire area is filled with zeros
  std::string zeros(area.size(), '\0');
  ASSERT_EQ(area, std::string_view(zeros));
}

void AssertSecurelyCleared(const std::string& string) {
  AssertSecurelyCleared(StringArea(string));
}

TEST(TestSecureString, SecureClearCheck) {
  // short string
  {
    std::string tiny("abc");
    auto old_area = StringArea(tiny);
    SecureString::SecureClear(tiny);
    AssertSecurelyCleared(tiny);
    AssertSecurelyCleared(old_area);
  }
  // etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But tiny.data() and old_area.data() point to the same address, what is the point testing old_area? Just in case pointers change? If they would change, then old_area would point to a freed memory area, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @EnricoMi , I realize I forgot to reply to this comment:

Just in case pointers change? If they would change, then old_area would point to a freed memory area, right?

Right, but the freed memory area was not necessarily cleaned up before being freed (unless the allocator was built in debug mode perhaps). That's why it's important to check that the old area doesn't contain the sensitive secrets anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also want a test that a SecureString was cleared on destruction ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I do that, the memory where SecureString.as_span points to will be freed on deconstruction. How can I access that memory securely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standardese "true" way would be to template SecureString on an allocator so that you could intercept destruction of the bytes it stores. That's entirely too much here; we could just add a debug assertion inside ~SecureString (maybe only enabled by an env var or #if defined(ARROW_ASSERT_SECURE_STRING_ZEROED))

Copy link
Member

@bkietz bkietz Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, alternatively: since every STL I'm aware of implements std::string with an inline short string case, we could use:

  // Most std::string impls store their contents inline, so we can check
  // the object representation for a crib to ensure that destruction does
  // wipe all sensitive bytes.

  constexpr auto kCrib = "01CRIB01";
  alignas(SecureString) char object_repr[sizeof(SecureString)];
  std::string_view object_repr_view{object_repr, sizeof(object_repr)};

  SecureString* cribbed = new (&object_repr) SecureString(std::string(kCrib));
  bool crib_found_before = object_repr_view.find(kCrib) != object_repr_view.npos;
  cribbed->~SecureString(); // NOTE: do not allow this call to be skipped!
  bool crib_found_after = object_repr_view.find(kCrib) != object_repr_view.npos;

  if (crib_found_before) {
    ASSERT_FALSE(crib_found_after);
  }

I think that's well-defined: https://godbolt.org/z/31cY59EM3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I do that, the memory where SecureString.as_span points to will be freed on deconstruction. How can I access that memory securely?

Depends what you call "securely", but in most cases (unless running under Valgrind or Address Sanitizer), you can still access a deallocated area. We should just try to do that here.

(this is morally the same as with the old area comment, by the way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is empty_string_ useful at all? It looks rather dubious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, method FileDecryptionProperties::column_key returns an empty string as the encryption key when the key id is not found. It should rather throw a meaningful key not found exception. I may fix that while I am touching this code anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a holdover from the first draft of encryption; IIRC column_key() returned references to strings at that time (so return ""; would fail to compile)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needed, improved name.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 3, 2025
@EnricoMi EnricoMi force-pushed the secure-string branch 2 times, most recently from 9848539 to b303afa Compare April 4, 2025 16:29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a holdover from the first draft of encryption; IIRC column_key() returned references to strings at that time (so return ""; would fail to compile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const seems warranted here

Suggested change
SecureString kFooterEncryptionKey_ = kFooterEncryptionKey;
const SecureString kFooterEncryptionKey_ = kFooterEncryptionKey;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that hold for all k... members in this class (TestDecryptionConfiguration) and TestEncryptionConfiguration as well? Is it worth it given this are tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems questionable to me that the ultimate fallback should not be a simple loop

Suggested change
__asm__ __volatile__("" ::"r"(data) : "memory");
for (size_t i = 0; i < size; ++i) {
data[i] = 0;
}

I don't know how cross platform this __asm__ block is.

Also, since this code is borrowed from another project it needs to be mentioned in LICENSE.txt, for example

arrow/LICENSE.txt

Lines 1541 to 1543 in 7df396e

A function gettimeofday in utilities.cc is based on
http://www.google.com/codesearch/p?hl=en#dR3YEbitojA/COPYING&q=GetSystemTimeAsFileTime%20license:bsd

Copy link
Contributor Author

@EnricoMi EnricoMi Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

  volatile char *p = data;
  while (size--) *p++ = 0;

https://cplusplus.com/articles/ETqpX9L8/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given https://github.com/BLAKE2/libb2/blob/master/src/blake2-impl.h is CC0 Public Domain Dedication, and our code deviates a lot, you could say it is as much inspired by that implementation than it is by all those variations available on stackoverflow.

I don't think this qualifies as a copy or derived work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's in the public domain, then I imagine an inline comment referencing that should be sufficient. @pitrou do you agree? We do also account for some public domain artifacts in LICENSE.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a comment, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou @bkietz

I have added more comments and links to prior art. I have also wrapped the __asm__ to enable it only for GCC and Clang: 510349c

We have two options for the fallback branch:

  1. the current memset implementation
  2. the while loop below

Both benefit from the __asm__ __volatile__ call.

  // Volatile pointer to the data is an attempt to avoid
  // that the compiler optimizes away the while loop.
  // https://cplusplus.com/articles/ETqpX9L8/
  // http://stackoverflow.com/a/13299459/965097
  // https://www.cryptologie.net/article/419/zeroing-memory-compiler-optimizations-and-memset_s/
  volatile char *p = data;
  while (size--) *p++ = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the while loop is preferable; it's no more lines of code and is probably more obvious to readers.

Currently the memset_v block doesn't exclude the inline asm block

// https://github.com/openssl/openssl/blob/3423c30db3aa044f46e1f0270e2ecd899415bf5f/crypto/mem_clr.c#L22
static const volatile auto memset_v = &memset;
memset_v(data, 0, size);
# if defined(__GNUC__) || defined(__clang__)

I'd recommend rewriting to

#if (inline_asm_conditions)
// inline asm
#else
// while loop
#endif

the compiler could ordinarily be relied on to omit the redundant write, but since we're handling volatile memory we'll probably generate code which zeroes twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the memset_v as it is more performant. Either memset or loop will require verbose comment to explain the meaning of the volatile, so I think both are similar "obvious" to readers.

I don't understand your comment on the inline asm block. The asm block is a NOOP, it does not zero any memory. It is merely there to tell the compiler that it cannot optimize away any code that modified data before this line. So it is not either asm or loop, it is either (memset or loop) and asm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standardese "true" way would be to template SecureString on an allocator so that you could intercept destruction of the bytes it stores. That's entirely too much here; we could just add a debug assertion inside ~SecureString (maybe only enabled by an env var or #if defined(ARROW_ASSERT_SECURE_STRING_ZEROED))

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 4, 2025
Copy link
Member

@bkietz bkietz Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please move this to arrow/util/? I think it will be useful outside parquet

Copy link
Contributor Author

@EnricoMi EnricoMi May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have extracted the SecureString implementation into a separate PR: #46626. That targets arrow/util.

This PR will focus in migrating parquet encryption to that SecureString implementation.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 11, 2025
Copy link
Contributor Author

@EnricoMi EnricoMi Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looked suspicious, all other setter methods check if the member is unset.
Here, the check is always true as it checks the input.

This is a breaking change if user code calls this setter twice with non-empty keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've corrected a bug; the intention with these setter methods seems to be one-time setting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are still returning a reference to a SecureString. I tried to return some std::optional, to replace empty SecureString notation for "no key given" with std::nullopt, but this opened a can of worms. Not going down this rabbit whole in this PR.

At least the naming is improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another use for a generic default reference: #46303 (comment)

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Jul 7, 2025

@pitrou I have removed the deprecated methods to "encourage" users to migrate to the new methods: 290449a

But this will require migrating the Python API as well:
https://github.com/apache/arrow/actions/runs/16126071702/job/45503157743?pr=46017#step:6:3956

I would like to move touching the Python API to a separate PR, which would contain the revert 290449a.

@pitrou
Copy link
Member

pitrou commented Jul 8, 2025

But this will require migrating the Python API as well: https://github.com/apache/arrow/actions/runs/16126071702/job/45503157743?pr=46017#step:6:3956

I would like to move touching the Python API to a separate PR, which would contain the revert 290449a.

Ok, that sounds good to me.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Jul 8, 2025

But this will require migrating the Python API as well: https://github.com/apache/arrow/actions/runs/16126071702/job/45503157743?pr=46017#step:6:3956
I would like to move touching the Python API to a separate PR, which would contain the revert 290449a.

Ok, that sounds good to me.

Alright, once you approve this PR, I will revert 290449a and we are ready to merge.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except two minor comments


// any empty SecureString key is interpreted as if no key is given
// this instance is used if a SecureString reference is returned
const ::arrow::util::SecureString no_key_ = ::arrow::util::SecureString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be an instance variable? It could just be a private constant inside encryption.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 184afcf.

void PutKey(uint32_t key_id, ::arrow::util::SecureString key);

::arrow::util::SecureString GetKey(const std::string& key_id_string) override {
// key_id_string is string but for IntegerKeyIdRetriever it encodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth inlining this method, is it? Especially as it will usually be called through the vtable, anyway.

Copy link
Contributor Author

@EnricoMi EnricoMi Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz suggested the implementation given the signature is somewhat surprising: #46017 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I do find the implementation weird as well, and I'm not sure this class is actually useful, but hey :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed surprising...

@EnricoMi EnricoMi requested a review from bkietz July 9, 2025 12:51
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Jul 9, 2025

Maybe @bkietz and @wgtmac want to make a final pass over the PR.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a quick pass on it. Thanks!

void PutKey(uint32_t key_id, ::arrow::util::SecureString key);

::arrow::util::SecureString GetKey(const std::string& key_id_string) override {
// key_id_string is string but for IntegerKeyIdRetriever it encodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed surprising...

@pitrou
Copy link
Member

pitrou commented Jul 10, 2025

@EnricoMi We have two +1s after a meticulous review pass, so perhaps you can go ahead and revert the compatibility-breaking changes so that we get a clean CI and merge.

@EnricoMi
Copy link
Contributor Author

✔️ All green.

@pitrou
Copy link
Member

pitrou commented Jul 15, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 04b15e2

Submitted crossbow builds: ursacomputing/crossbow @ actions-12f2531100

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou merged commit 6e5d143 into apache:main Jul 15, 2025
36 of 37 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants