-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-43057: [C++] Thread-safe AesEncryptor / AesDecryptor #44990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The test failures look like they might be caused by openssl/openssl#21955 The failing ASAN test run is using OpenSSL 3.02, but builds with other versions are passing. Eg. the Ubuntu 20.04 build uses OpenSSL 1.1.1f and the Conda build has OpenSSL 3.3.2. The fix for this was backported to the 3.0 branch and released in 3.0.13 (openssl/openssl#23102), but Ubuntu 22.04 still uses 3.0.2. If there's not an easy way around this, maybe we need a slower code path for 3.0.2 that creates a new context each time? |
Let's use the "slower" code path everywhere? I'm skeptical it will be that slow... |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for stepping in and submitting this @EnricoMi . You'll find a number of comments below.
a2e3992 to
108c8cf
Compare
|
I have addressed all comments except for the test related ones. Will go over them tomorrow. Thanks for your input, that is highly appreciated! |
|
I tested concurrent scans of a larger dataset with uniform encryption, and this change doesn't completely fix that scenario: --- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -289,6 +289,8 @@ TEST_F(DatasetEncryptionTest, ReadSingleFile) {
// processing encrypted datasets over 2^15 rows in multi-threaded mode.
class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
public:
+ LargeRowEncryptionTest() : DatasetEncryptionTestBase(true) {}
+
// The dataset is partitioned using a Hive partitioning scheme.
void PrepareTableAndPartitioning() override {
// Specifically chosen to be greater than batch size for triggering prefetch.
@@ -307,7 +309,7 @@ class LargeRowEncryptionTest : public DatasetEncryptionTestBase {
// Test for writing and reading encrypted dataset with large row count.
TEST_F(LargeRowEncryptionTest, ReadEncryptLargeRows) {
- ASSERT_NO_FATAL_FAILURE(TestScanDataset());
+ ASSERT_NO_FATAL_FAILURE(TestScanDataset(true));
}
} // namespace datasetThis gives errors like: I believe this is due to the decryptor AAD being mutable. It's updated for each data page read, so concurrent use will result in incorrect AADs being used in some threads. Rather than mutating the decryptor state, it might be possible to refactor this so that the AAD values are passed in to the decrypt method as a parameter. I think this should probably be addressed as a separate PR though as the changes should be orthogonal. |
Confirmed in ec9b054dcf2b10bbed8c2c38b93b88a17d5d3d06. Multi-threaded uniform encryption read fails. |
33ff64b to
6508c5a
Compare
Attempt to fix multi-threaded AAD update: d77a081a49b3436fd0bc16ce3d10792b7222e026 The |
|
Any updates? |
4fc5b42 to
b652e06
Compare
b652e06 to
fbe3a1c
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EnricoMi Sorry for the delay and thanks a lot for persevering on this. I think the fix is still not 100% correct, see comments below. But we're nearing a solution.
Also, I can help on this PR if you want.
85c55ba to
4d388fa
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
@pitrou suggestions applied and rebased |
|
@github-actions crossbow submit -g cpp -g python |
|
@github-actions crossbow submit cp313 |
|
Revision: a587553 Submitted crossbow builds: ursacomputing/crossbow @ actions-05dac70ebf |
|
Revision: a587553 Submitted crossbow builds: ursacomputing/crossbow @ actions-34c96e7a1d |
|
@github-actions crossbow submit -g cpp |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot for this @EnricoMi !
|
Revision: 9961069 Submitted crossbow builds: ursacomputing/crossbow @ actions-ebb6807bc5 |
|
@pitrou thank you for your time! Will this make it into the 20.0.0 release? |
|
@EnricoMi It should, I'll wait for CI and merge if green. |
|
I think the release 20.0.0 has been feature-freezed, and the maint-20.0.0 has been branched earlier today? |
|
Can we consider this a bug fix and not a feature so it could still make it into the 20.0.0 maint branch? Or is "feature freeze" more a general "code freeze"? |
|
This is definitely a bug fix (it fixes an actual bug) even though it's quite elaborate and goes deeper than simply preventing nasty behavior. |
|
Well, I mean that with feature freeze done (== maintenance branch branched out), it doesn't happen naturally for this PR to be in 20.0.0. We may ask for help from release managers. And yes a bug fix still gets a chance to be back-ported (depending on the severity I guess?). |
|
The release manager (@assignUser ) has been notified and is happy to add it to the maintenance branch, as per Zulip conversation. |
|
CI failures are unrelated. |
|
Awesome, many thanks! Happy to help with the back-porting, please let me know if I can help! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6452194. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 48 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…#44990) ### Rationale for this change OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call `WipeOut` while other threads still use the instance. ### What changes are included in this PR? - Remove the `WipeOut` methods and related datastructures entirely. - Each call into `CtrEncrypt` / `CtrDecrypt` and `GcmEncrypt` / `GcmDecrypt` uses its own `EVP_CIPHER_CTX` instance, making this thread-safe. After fixing this `"AesDecryptor was wiped out"` issue, two other segmentation faults surfaced: apacheGH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue. Fixes apacheGH-43057. Fixes apacheGH-44852. Fixes apacheGH-44988. ### Are these changes tested? A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs. ### Are there any user-facing changes? No. * GitHub Issue: apache#43057 Lead-authored-by: Enrico Minack <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
OpenSSL encryption / decryption is wrapped by AesEncryptor / AesDencryptor, which is used by multiple threads of a single scanner or by multiple concurrent scanners when scanning a dataset. Some thread may call
WipeOutwhile other threads still use the instance.What changes are included in this PR?
WipeOutmethods and related datastructures entirely.CtrEncrypt/CtrDecryptandGcmEncrypt/GcmDecryptuses its ownEVP_CIPHER_CTXinstance, making this thread-safe.After fixing this
"AesDecryptor was wiped out"issue, two other segmentation faults surfaced: GH-44988. This has also been addressed here as it can only be exposed after fixing the wipe-out issue.Fixes GH-43057.
Fixes GH-44852.
Fixes GH-44988.
Are these changes tested?
A unit test that scans a dataset concurrently reproduced the initial issue in 30% of the test runs.
Are there any user-facing changes?
No.