-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-31603: [C++] Add SecureString implementation to arrow/util/ #46626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this @EnricoMi . This looks generally sound, some assorted comments below.
|
@pitrou thanks for the thorough review. Any suggestions how I could get all those branches in |
We have many CI platforms already, including macOS, Windows and a number of different Linux flavors. They will not cover everything, but we may deem them sufficient for now. |
b4d75af to
973b233
Compare
cpp/src/arrow/util/secure_string.cc
Outdated
| // - requires secure cleaning the local buffer | ||
| // If secret is longer, moves the pointer to secret_, resets to 0, uses local buffer | ||
| // - does not require cleaning anything | ||
| secret_ = std::move(secret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou This requires the std::string implementation to move (reuse) the string memory (not the local buffer) of long non-local strings. If the implementation would copy the string memory and the free the memory, we would leak the secret.
Is it safe to assume secret_ = std::move(secret) reuses the non-local memory in all implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect it from any decent implementation. Besides, the move constructor is specified as noexcept, which is a hint that it should not dynamically allocate memory.
Perhaps we can actually add an assertion for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added assertion and more comments in 9ee3e2c. The Please advice if ARROW_CHECK is the right assertion here. Does noexcept annotation still apply for those constructors / operators that use secure_move? Or does ARROW_CHECK only log and never throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_CHECK aborts on error, so yes, this looks ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want Arrow / Parquet to die on this check? Upgrading Arrow might then break for some users. Though we would not anticipate any std::string implementation to fail this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok. We expect it not to happen, but we would like to know if some std::string implementation fails the expectation.
Nevermind, I have found |
f72522e to
8d9c4f9
Compare
That's neat, I didn't know about it. Thank you! |
|
There are some CI failures. Example here: |
|
@pitrou all issues fixed, the remaining CI issues seem unrelated. |
cpp/src/arrow/util/secure_string.cc
Outdated
| secret->clear(); | ||
| } | ||
|
|
||
| inline void SecureString::SecureClear(uint8_t* data, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou Would you prefer moving this up right below secure_move, as this is the essence of the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EnricoMi That's a good idea, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, this is really looking good. A couple more (minor) suggestions.
cpp/src/arrow/util/secure_string.cc
Outdated
| /// | ||
| /// This condition is checked by secure_move. | ||
|
|
||
| void secure_move(std::string& string, std::string& dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rename this to SecureMove and put it in the anonymous namespace so that it's not exposed as a public symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/arrow/util/secure_string.cc
Outdated
| secret->clear(); | ||
| } | ||
|
|
||
| inline void SecureString::SecureClear(uint8_t* data, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EnricoMi That's a good idea, yes.
| #define COMPARE(val1, val2) \ | ||
| ::testing::internal::EqHelper::Compare(#val1, #val2, val1, val2) | ||
|
|
||
| ::testing::AssertionResult AssertSecurelyCleared(const std::string_view& area) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't call it "AssertSomething" since it doesn't actually assert. Perhaps SecurelyCleared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to IsSecurelyCleared, then these statements read naturally:
ASSERT_TRUE(IsSecurelyCleared(string))
Co-authored-by: Antoine Pitrou <[email protected]>
- Improve imports and definition export - Add lines between definitions - Improve comments - Remove redundant ::arrow::util:: when using span - Remove test setup assertions - Reuse view in test
- Rename secure_move to SecureMove - Move SecureMove into anonymous namespace - Move SecureClear up in source file - Rename AssertSecurelyCleared to IsSecurelyCleared - Remove std::string_view(std::string) from tests where not needed
0ec848c to
99cd8c3
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great now. Thanks a lot @EnricoMi !
|
@github-actions crossbow submit -g cpp |
|
@github-actions crossbow submit wheelcp313* |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
The Construct test was relying on short string optimization and failed on the Emscripten build. I've pushed a fix and made it exercise both short and long strings. |
|
@github-actions crossbow submit -g cpp |
|
Revision: 408572c Submitted crossbow builds: ursacomputing/crossbow @ actions-7d467026d1 |
|
The remaining CI failures are unrelated, I'll merge. |
|
@pitrou thank you very much, let's get this used in Arrow now! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6f43aea. There were 70 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…pache#46626) ### Rationale for this change Arrow deals with secrets like encryption / decryption keys which must be kept private. One way of leaking such secrets is through memory allocation where another process allocates memory that previously hold the secret, because that memory was not cleared before being freed. ### What changes are included in this PR? Uses various implementations of securely clearing memory, notably - `SecureZeroMemory`(Windows) - `memset_s`(STDC) - `OPENSSL_cleanse` (OpenSSL >= 3) - `explicit_bzero`(glibc 2.25+) - volatile `memset` (fallback). ### Are these changes tested? Unit tests. ### Are there any user-facing changes? This only adds the `SecureString` class and tests. Using this new infrastructure is done in follow-up pull requests. * GitHub Issue: apache#31603 Lead-authored-by: Enrico Minack <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change Cryptographic keys must be kept private. Using the new `arrow::util::SecureString` is vital for storing secrets securely. ### What changes are included in this PR? Uses the `arrow::util::SecureString` introduced in #46626 for cryptographic keys throughout Parquet encryption. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? APIs that hand over secrets to Arrow require the secret to be encapsulated in a `SecureString`. **This PR includes breaking changes to public APIs.** TODO: - provide instructions for migration Supersedes #12890. * GitHub Issue: #31603 Lead-authored-by: Enrico Minack <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
* GH-46745: [C++] Update bundled Boost to 1.88.0 and Apache Thrift to 0.22.0 (#46912) ### Rationale for this change Bundled Boost 1.81.0 and Apache Thrift 0.22.0 are old. It's difficult to upgrade only Boost because Apache Thrift depends on Boost. So this PR updates bundled Boost and Apache Thrift. ### What changes are included in this PR? * Update bundled Boost: * Use CMake based build instead of b2 * Use FetchContent not ExternalProject * Stop using our trimmed Boost source archive * Update bundled Apache Thrift: * Use FetchContent not ExternalProject ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #46745 * GitHub Issue: #46740 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]> * GH-46962: [C++][Parquet] Generic xsimd function and dynamic dispatch for Byte Stream Split (#46963) Thanks for opening a pull request! ### Rationale for this change Lot of linux systems ship arrow with SSE4.2, but the AVX2 instructions are quite available. For byte stream split, they are faster than SSE4.2. ### What changes are included in this PR? - Make the xsimd functions refactored in #46789 to make them arch independent. - Use dynamic dispatch to AVX2 at runtime if available (it was considered that builds without SSE4.2 or Neon at compile time were not so popular to add them to the dynamic dispatch). ### Are these changes tested? Yes, the exisiting tests already cover the code. ### Are there any user-facing changes? No * GitHub Issue: #46962 Lead-authored-by: AntoinePrv <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE (#44739) ### Rationale for this change We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate. Unfortunately, the Parquet stream writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type. ### What changes are included in this PR? Allow the Parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader. The reader part has already been fixed in 4d825497cb04c9e1c288000a7a8f75786cc487ff and this uses a similar implementation, but with a stricter set of "exceptions" (only BYTE_ARRAY with NONE type are allowed). ### Are these changes tested? Yes. ### Are there any user-facing changes? Only a new feature. * GitHub Issue: #42971 Authored-by: Adrien Destugues <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-47012: [C++][Parquet] Reserve values correctly when reading BYTE_ARRAY and FLBA (#47013) ### Rationale for this change When reading a Parquet leaf column as Arrow, we [presize the Arrow builder](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/arrow/reader.cc#L487-L488) so as to avoid spurious reallocations during incremental Parquet decoding calls. However, the Reserve method on RecordReader will [only properly reserve values](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/column_reader.cc#L1693-L1696) for non-FLBA non-BYTE_ARRAY physical types. The result is that, on some of our micro-benchmarks, we spend a significant amount of time reallocating data on the ArrayBuilder. ### What changes are included in this PR? Properly reserve space on Array builders when reading Parquet data as Arrow. Note that, when reading into Binary or LargeBinary, this doesn't avoid reallocations for the actual data. However, for FixedSizeBinary and BinaryView, this is sufficient to avoid any reallocations. Benchmark numbers on my local machine (Ubuntu 24.04): ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Non-regressions: (250) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1 3.295 GiB/sec 7.834 GiB/sec 137.771 {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 118} BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1 3.453 GiB/sec 8.148 GiB/sec 135.957 {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 119} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100 1.360 GiB/sec 1.780 GiB/sec 30.870 {'family_index': 13, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100 1.360 GiB/sec 1.780 GiB/sec 30.861 {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0 1.292 GiB/sec 1.662 GiB/sec 28.666 {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0 1.304 GiB/sec 1.665 GiB/sec 27.691 {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46} BM_ReadBinaryViewColumn/null_probability:99/unique_values:32 959.085 MiB/sec 1.185 GiB/sec 26.568 {'family_index': 15, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99 1.012 GiB/sec 1.210 GiB/sec 19.557 {'family_index': 13, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1 1.011 GiB/sec 1.187 GiB/sec 17.407 {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99 1.024 GiB/sec 1.201 GiB/sec 17.206 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36} BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1 1.023 GiB/sec 1.197 GiB/sec 17.016 {'family_index': 15, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadBinaryColumn/null_probability:99/unique_values:32 541.347 MiB/sec 632.640 MiB/sec 16.864 {'family_index': 14, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1 954.762 MiB/sec 1.084 GiB/sec 16.272 {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1 970.997 MiB/sec 1.100 GiB/sec 15.969 {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 34} BM_ReadBinaryColumn/null_probability:99/unique_values:-1 592.605 MiB/sec 666.605 MiB/sec 12.487 {'family_index': 14, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10} BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1 587.604 MiB/sec 659.154 MiB/sec 12.177 {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10} BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1 867.001 MiB/sec 962.427 MiB/sec 11.006 {'family_index': 15, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50 473.040 MiB/sec 522.948 MiB/sec 10.551 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17} BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1 1.633 GiB/sec 1.800 GiB/sec 10.197 {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5} BM_ReadStructOfListColumn/50 466.944 MiB/sec 513.407 MiB/sec 9.951 {'family_index': 20, 'per_family_instance_index': 2, 'run_name': 'BM_ReadStructOfListColumn/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1 894.649 MiB/sec 976.595 MiB/sec 9.160 {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50 479.717 MiB/sec 523.293 MiB/sec 9.084 {'family_index': 13, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17} BM_ReadBinaryColumn/null_probability:50/unique_values:-1 613.860 MiB/sec 667.963 MiB/sec 8.814 {'family_index': 14, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1 1.479 GiB/sec 1.608 GiB/sec 8.761 {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1 1.628 GiB/sec 1.762 GiB/sec 8.235 {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5} BM_ReadStructOfListColumn/0 760.221 MiB/sec 822.339 MiB/sec 8.171 {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ReadStructOfListColumn/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47} BM_ReadBinaryViewColumn/null_probability:1/unique_values:32 843.826 MiB/sec 912.397 MiB/sec 8.126 {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryViewColumn/null_probability:50/unique_values:32 699.538 MiB/sec 755.468 MiB/sec 7.995 {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024 3.724 GiB/sec 4.007 GiB/sec 7.597 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 176027} BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1 1.474 GiB/sec 1.586 GiB/sec 7.591 {'family_index': 15, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryColumn/null_probability:0/unique_values:-1 1.114 GiB/sec 1.192 GiB/sec 7.005 {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:1/unique_values:-1 1.022 GiB/sec 1.091 GiB/sec 6.715 {'family_index': 14, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1 1.101 GiB/sec 1.174 GiB/sec 6.557 {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000 18.019 MiB/sec 19.100 MiB/sec 5.997 {'family_index': 33, 'per_family_instance_index': 14, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6295} BM_ReadBinaryViewColumn/null_probability:0/unique_values:32 893.151 MiB/sec 945.900 MiB/sec 5.906 {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000 20.243 MiB/sec 21.404 MiB/sec 5.733 {'family_index': 33, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7257} BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1 620.583 MiB/sec 655.859 MiB/sec 5.684 {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:0/unique_values:32 751.375 MiB/sec 793.728 MiB/sec 5.637 {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:50/unique_values:32 537.693 MiB/sec 567.159 MiB/sec 5.480 {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100 44.112 MiB/sec 46.474 MiB/sec 5.355 {'family_index': 33, 'per_family_instance_index': 6, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15273} BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000 20.750 MiB/sec 21.843 MiB/sec 5.265 {'family_index': 30, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7387} BM_ReadColumn<false,Int32Type>/-1/10 7.621 GiB/sec 8.019 GiB/sec 5.223 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumn<false,Int32Type>/-1/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 137} [ ... snip non-significant changes ... ] --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Regressions: (4) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BM_ReadListColumn/99 1.452 GiB/sec 1.379 GiB/sec -5.006 {'family_index': 21, 'per_family_instance_index': 3, 'run_name': 'BM_ReadListColumn/99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 129} BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024 270.542 MiB/sec 256.345 MiB/sec -5.248 {'family_index': 27, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32060} BM_ArrowBinaryPlain/DecodeArrow_Dict/65536 172.371 MiB/sec 162.455 MiB/sec -5.753 {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 319} BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024 189.008 MiB/sec 176.900 MiB/sec -6.406 {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22292} ``` ### Are these changes tested? By existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #47012 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-47037: [CI][C++] Fix Fedora 39 CI jobs (#47038) ### Rationale for this change The system package for xsimd is too old on Fedora 39, use bundled version instead. ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #47037 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-45639: [C++][Statistics] Add support for ARROW:average_byte_width:{exac,approximate} (#46385) ### Rationale for this change `ARROW:average_byte_width:exact` and `ARROW:average_byte_width:approximate` statistics attributes are missing in `arrow::ArrayStatistics`. ### What changes are included in this PR? Add `average_byte_width` and `is_average_byte_width_exact` member variables to `arrow::ArrayStatistics`. ### Are these changes tested? Yes, I run the relevant unit tests ### Are there any user-facing changes? Yes * GitHub Issue: #45639 Lead-authored-by: Arash Andishgar <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-46938: [C++] Enhance arrow::ChunkedArray::Equals to support floating-point comparison when values share the same memory (#47044) ### Rationale for this change As discussed [here](https://github.com/apache/arrow/issues/46938#issue-3187249840), this is a minor enhancement to `arrow::ChunkedArray::Equals`. ### What changes are included in this PR? A minor improvement to `arrow::ChunkedArray::Equals` to handle the case where chunked arrays share the same underlying memory. ### Are these changes tested? Yes, I ran the relevant unit tests. ### Are there any user-facing changes? No. * GitHub Issue: #46938 Authored-by: Arash Andishgar <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-46989: [CI][R] Use Ubuntu 20.04 instead of OpenSUSE for R 4.1 (#46990) ### Rationale for this change OpenSUSE 15.5 ships old GCC (7.5) that doesn't have enough C++17 support. ### What changes are included in this PR? Use Ubuntu 20.04 that ships GCC 9.3 instead of OpenSUSE 15.5. Ubuntu 20.04 reached EOL but we can use it for now. We discussed why we need OpenSUSE 15.5 based job at https://github.com/apache/arrow/issues/45718#issuecomment-2743538384 . We have the job because https://arrow.apache.org/docs/developers/cpp/building.html said "gcc 7.1 and higher should be sufficient". We need require GCC 9 or later with #46813. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46989 Lead-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47033: [C++][Compute] Never use custom gtest main with MSVC (#47049) ### Rationale for this change If we use custom gtest main with MSVC, it always reports "SEH exception". ### What changes are included in this PR? Remove MSVC version check. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47033 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]> * GH-38211: [MATLAB] Add support for creating an empty `arrow.tabular.RecordBatch` by calling `arrow.recordBatch` with no input arguments (#47060) ### Rationale for this change Currently, the `arrow.table` construction function will return an empty `arrow.tabular.Table` if no input arguments are passed to the function. However, `arrow.recordBatch` throws an error in this case. We should consider making `arrow.recordBatch` behave consistently with `arrow.table` in this case. This should be relatively straightforward to implement. We can just set the input argument `T` to default to `table.empty(0,0)` in the `arguments` block of the `recordBatch` function, in the same way that `arrow.table` does: https://github.com/apache/arrow/blob/73454b7040fbea3a187c1bfabd7ea02d46ca3c41/matlab/src/matlab/%2Barrow/table.m#L21 ### What changes are included in this PR? Updated the `arrow.recordBatch` function to return an `arrow.tabular.RecordBatch` instance with zero columns and zero rows if called with zero input arguments. Before this change, the `arrow.recordBatch` function would throw an error if called with zero input arguments. **Example Usage:** ```matlab >> rb = arrow.recordBatch() rb = Arrow RecordBatch with 0 rows and 0 columns ``` ### Are these changes tested? Yes. Added a new test case to `tRecordBatch` called `ConvenienceConstructorZeroArguments`. ### Are there any user-facing changes? Yes. Users can now call `arrow.recordBatch` with zero input arguments. * GitHub Issue: #38211 Authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]> * GH-47061: [Release] Fix wrong variable name for signing (#47062) ### Rationale for this change We must use GPG key ID not GPG key itself for `gpg --local-user`. ### What changes are included in this PR? Use `ARROW_GPG_KEY_UID`. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47061 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47063: [Release] Define missing RELEASE_TARBALL (#47064) ### Rationale for this change `RELEASE_TARBALL` is registered to `GITHUB_ENV` but isn't defined in this context. ### What changes are included in this PR? Define `RELEASE_TARBARLL`. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47063 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47047: [CI][C++] Use Google Cloud Storage Testbench v0.55.0 (#47048) ### Rationale for this change v0.55.0 is the latest version. v0.39.0 depends on old grpcio (1.59.0) that doesn't provide wheels for Python 3.13. ### What changes are included in this PR? Update the default Google Cloud Storage Testbench version. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47047 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47065: [Release] Fix timeout key in verify_rc.yml (#47066) ### Rationale for this change We must use `timeout-minutes` not `timeout` for timeout. ### What changes are included in this PR? Fix key. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47065 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47067: [Release] Fix wrong GitHub Actions context in verify_rc.yml (#47068) ### Rationale for this change We must use `inputs` not `input` for inputs for workflow dispatch: https://docs.github.com/en/actions/reference/contexts-reference#inputs-context ### What changes are included in this PR? Fix the context name. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47067 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47069: [Release] Add missing "needs: target" (#47070) ### Rationale for this change We need `needs: target` for jobs that use the `target` job outputs. ### What changes are included in this PR? Add missing `needs: target`s. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47069 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47071: [Release] Dereference all hard links in source archive (#47072) ### Rationale for this change Apache Rat doesn't like hard links. ### What changes are included in this PR? Use `tar --hard-dereference`. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47071 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47074: [Release] Use reproducible mtime for csharp/ in source archive (#47076) ### Rationale for this change The current source archive creation is reproducible when we use the same Git working tree. But it's not reproducible when we use different Git working trees. ### What changes are included in this PR? Use the committer date of the target commit instead of the `charp/` mtime in the current Git working tree for `csharp/` in the source archive. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47074 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47078: [Release] Ensure using cloned apache/arrow for reproducible check (#47079) ### Rationale for this change We need to use `dev/release/utils-create-release-tarball.sh` that exists in the target apache/arrow directory. ### What changes are included in this PR? Use `dev/release/utils-create-release-tarball.sh` in cloned apache/arrow. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47078 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47045: [CI][C++] Use Fedora 42 instead of 39 (#47046) ### Rationale for this change Fedora 39 reached EOL on 2024-11-26: https://docs.fedoraproject.org/en-US/releases/eol/ ### What changes are included in this PR? Use Fedora 42 that is the latest release. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47045 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47005: [C++] Disable exporting CMake packages (#47006) ### Rationale for this change azure-sdk-for-cpp uses `export(PACKAGE)` https://cmake.org/cmake/help/latest/command/export.html#package . It changes user package registry (`~/.cmake/packages/`) https://cmake.org/cmake/help/latest/manual/cmake-packages.7.html#user-package-registry . It's outside of a build directory. If user package registry is changed, other build may be failed. ### What changes are included in this PR? Disable `export(PACKAGE)`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #47005 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47081: [Release] Verify reproducible source build explicitly (#47082) ### Rationale for this change There are 2 problems on verification of reproducible source archive: 1. CI on macOS isn't prepared correctly 2. Some verification environments may not have required tools FYI: We need the following to check reproducible build on macOS: * Ensure using apache/arrow for `GITHUB_REPOSITORY` * `GITHUB_REPOSITORY` is defined automatically on GitHub Actions. Our Crossbow based verification job has `GITHUB_REPOSITORY=ursacomputing/crossbow` by default. * GNU tar * GNU gzip ### What changes are included in this PR? For the problem1: * Set `GITHUB_REPOSITORY` explicitly * Install GNU gzip (GNU tar is already installed) For the problem2: * Add `TEST_SOURCE_REPRODUCIBLE` that is `0` by default * Set `TEST_SOURCE_REPRODUCIBLE=1` on CI * At least one PMC member must set `TEST_SOURCE_REPRODUCIBLE=1` on release verification ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47081 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47092: [Release] Fix errors in APT/Yum previous version verification (#47093) ### Rationale for this change There are some problems in APT/Yum previous version verification: * There are some typos * Can't reuse `dev/release/verify-release-candidate.sh` for the previous version verification ### What changes are included in this PR? * Fix typos * Reuse `dev/release/verify-release-candidate.sh` for the previous version verification * Ignore the previous version verification result for now * We may revisit this once we can fix the current problems. See the added comments for details. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47092 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-41110: [C#] Handle empty stream in ArrowStreamReaderImplementation (#47098) ### Rationale for this change Fixing #41110. ### What changes are included in this PR? Handle empty stream in `ArrowStreamReaderImplementation`. Similar changes have *not* been made to `ArrowMemoryReaderImplementation` or `ArrowFileReaderImplementation`. ### Are these changes tested? Two basic unit tests have been created to validate the new behavior. This might not be sufficient to cover all cases where an empty stream should be handled without an exception occurring. Original change by @ voidstar69; this takes his change and applies the PR feedback to it. * GitHub Issue: #41110 Lead-authored-by: voidstar69 <[email protected]> Co-authored-by: Curt Hagenlocher <[email protected]> Signed-off-by: Curt Hagenlocher <[email protected]> * MINOR: [C#] Bump BenchmarkDotNet and 6 others (#46828) Performed the following updates: - Updated BenchmarkDotNet from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props - Updated BenchmarkDotNet.Diagnostics.Windows from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props - Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj - Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props - Updated Grpc.AspNetCore from 2.67.0 to 2.71.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props - Updated Grpc.Tools from 2.71.0 to 2.72.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj - Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.0 in /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj - Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props - Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Lead-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Curt Hagenlocher <[email protected]> Signed-off-by: Curt Hagenlocher <[email protected]> * GH-38532: [MATLAB] Add a `validate` method to all `arrow.array.Array` classes (#47059) ### Rationale for this change As a follow up to #38531 (see https://github.com/apache/arrow/pull/38531#discussion_r1377981403), we should consider adding a `validate` method to all `arrow.array.Array` classes, which would allow users to explicitly validate the contents of an `arrow.array.Array` after it is created. ### What changes are included in this PR? Added `validate()` as a method to `arrow.array.Array`. This method has one name-value pair which is called `ValidationMode`. `ValidationMode` can either be specified as `"minimal"` or `"full"`. By default, `ValidationMode="minimal"`. **Example Usage:** ```matlab >> offsets = arrow.array(int32([0 1 0])); >> values = arrow.array(1:3); >> array = arrow.array.ListArray.fromArrays(offsets, values); >> array.validate(ValidationMode="full") >> array.validate(ValidationMode="full") Error using . (line 63) Offset invariant failure: non-monotonic offset at slot 2: 0 < 1 Error in arrow.array.Array/validate (line 68) obj.Proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` ### Are these changes tested? Yes. Added a MATLAB test class called `tValidateArray.m`. ### Are there any user-facing changes? Yes. There is a new public method that is accessible via any subclass of `arrow.array.Array`. * GitHub Issue: #38532 Lead-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Kevin Gurney <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]> * GH-47027: [C++][Parquet] Fix repeated column pages not being written when reaching page size limit (#47032) ### Rationale for this change Ensures Parquet pages are written when the buffered data reaches the configured page size, while also ensuring pages are only split on record boundaries when required. Without this fix, page sizes can grow unbounded until the row group is closed. ### What changes are included in this PR? Fixes off-by-one error in logic to control when pages can be written. ### Are these changes tested? Yes, added a new unit test. ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This bug could cause a crash when writing a large number of rows of a repeated column and reaching a page size > max int32. * GitHub Issue: #47027 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Adam Reeve <[email protected]> * GH-47088: [CI][Dev] Fix shellcheck errors in the ci/scripts/integration_arrow.sh (#47089) ### Rationale for this change This is the sub issue #44748. * SC2046: Quote this to prevent word splitting. * SC2086: Double quote to prevent globbing and word splitting. * SC2102: Ranges can only match single chars (mentioned due to duplicates). * SC2223: This default assignment may cause DoS due to globbing. Quote it. ``` ci/scripts/integration_arrow.sh In ci/scripts/integration_arrow.sh line 27: : ${ARROW_INTEGRATION_CPP:=ON} ^--------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it. In ci/scripts/integration_arrow.sh line 28: : ${ARROW_INTEGRATION_CSHARP:=ON} ^-----------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it. In ci/scripts/integration_arrow.sh line 30: : ${ARCHERY_INTEGRATION_TARGET_IMPLEMENTATIONS:=cpp,csharp} ^-- SC2223 (info): This default assignment may cause DoS due to globbing. Quote it. In ci/scripts/integration_arrow.sh line 33: . ${arrow_dir}/ci/scripts/util_log.sh ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: . "${arrow_dir}"/ci/scripts/util_log.sh In ci/scripts/integration_arrow.sh line 36: pip install -e $arrow_dir/dev/archery[integration] ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^-----------^ SC2102 (info): Ranges can only match single chars (mentioned due to duplicates). Did you mean: pip install -e "$arrow_dir"/dev/archery[integration] In ci/scripts/integration_arrow.sh line 66: --with-cpp=$([ "$ARROW_INTEGRATION_CPP" == "ON" ] && echo "1" || echo "0") \ ^-- SC2046 (warning): Quote this to prevent word splitting. In ci/scripts/integration_arrow.sh line 67: --with-csharp=$([ "$ARROW_INTEGRATION_CSHARP" == "ON" ] && echo "1" || echo "0") \ ^-- SC2046 (warning): Quote this to prevent word splitting. In ci/scripts/integration_arrow.sh line 68: --gold-dirs=$gold_dir/0.14.1 \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/0.14.1 \ In ci/scripts/integration_arrow.sh line 69: --gold-dirs=$gold_dir/0.17.1 \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/0.17.1 \ In ci/scripts/integration_arrow.sh line 70: --gold-dirs=$gold_dir/1.0.0-bigendian \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/1.0.0-bigendian \ In ci/scripts/integration_arrow.sh line 71: --gold-dirs=$gold_dir/1.0.0-littleendian \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/1.0.0-littleendian \ In ci/scripts/integration_arrow.sh line 72: --gold-dirs=$gold_dir/2.0.0-compression \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/2.0.0-compression \ In ci/scripts/integration_arrow.sh line 73: --gold-dirs=$gold_dir/4.0.0-shareddict \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/4.0.0-shareddict \ For more information: https://www.shellcheck.net/wiki/SC2046 -- Quote this to prevent word splitt... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... https://www.shellcheck.net/wiki/SC2102 -- Ranges can only match single char... ``` ### What changes are included in this PR? Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47088 Authored-by: Hiroyuki Sato <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-46629: [Python] Add options to DatasetFactory.inspect (#46961) ### Rationale for this change See https://github.com/apache/arrow/issues/46629. ### What changes are included in this PR? This PR updates the `DatasetFactory.inspect` method so that it accepts new `promote_options` and `fragments` parameters. Since we parse string into a `MergeOptions` struct in three different places, this PR defines the helper function `_parse_field_merge_options`. ### Are these changes tested? Yes. ### Are there any user-facing changes? This adds optional parameters to a public method. It changes the default behavior from checking one fragment to checking all fragments (the old documentation said it inspected "all data fragments" even though it didn't). * GitHub Issue: #46629 Lead-authored-by: Hadrian Reppas <[email protected]> Co-authored-by: Hadrian Reppas <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-31603: [C++] Wrap Parquet encryption keys in SecureString (#46017) ### Rationale for this change Cryptographic keys must be kept private. Using the new `arrow::util::SecureString` is vital for storing secrets securely. ### What changes are included in this PR? Uses the `arrow::util::SecureString` introduced in #46626 for cryptographic keys throughout Parquet encryption. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? APIs that hand over secrets to Arrow require the secret to be encapsulated in a `SecureString`. **This PR includes breaking changes to public APIs.** TODO: - provide instructions for migration Supersedes #12890. * GitHub Issue: #31603 Lead-authored-by: Enrico Minack <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-46728: [Python] Skip test_gdb.py tests if PyArrow wasn't built debug (#46755) ### Rationale for this change As mentioned in #46728, if Arrow C++ was built debug, and PyArrow wasn't, test_gdb.py runs tests that fail. ### What changes are included in this PR? The CMAKE_BUILD_TYPE environment variable is propagated from build into PyArrow, where it's checked to skip unit tests. ### Are these changes tested? Yes. I have built PyArrow in release, debug, and relwithdebinfo and observed the new behavior. Because CMakeLists.txt was changed, I built PyArrow twice via setup.py and pip install, and checked the new function. ### Are there any user-facing changes? Devs may skip unit tests that would fail. PyArrow now has build_info() with information about the build type. * GitHub Issue: #46728 Lead-authored-by: Eric Dinse <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-46860: [C++] Making HalfFloatBuilder accept Float16 as well as uint16_t (#46981) ### Rationale for this change #46860 Adding convenience methods for appending and retrieving Float16 to HalfFloatBuilder. ### What changes are included in this PR? HalfFloatBuilder has functions overloaded to accept Float16, tests, and documentation. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46860 Lead-authored-by: Eric Dinse <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-38572: [Docs][MATLAB] Update `arrow/matlab/README.md` with the latest change. (#47109) ### Rationale for this change The documentation for the MATLAB interface hasn't been updated in a while and is out of date. This PR includes several miscellaneous updates and enhancements to the documentation. ### What changes are included in this PR? 1. Updated the display output for the various `arrow.*` types used in the code snippets in [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md). 2. Updated the table of supported `arrow.array.Array` types in [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md). 3. Added a link to the [Testing Guidelines](https://github.com/apache/arrow/blob/main/matlab/doc/testing_guidelines_for_the_matlab_interface_to_apache_arrow.md) in the Testing section of [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md) 4. Added a link to the [Testing Guidelines](https://github.com/apache/arrow/blob/main/matlab/doc/testing_guidelines_for_the_matlab_interface_to_apache_arrow.md) in [matlab_interface_for_apache_arrow.md](https://github.com/apache/arrow/blob/main/matlab/doc/matlab_interface_for_apache_arrow_design.md). ### Are these changes tested? N/A. This is a documentation only change. I visually inspected the rendered Markdown for all modified documents to ensure they updated contents display as expected. ### Are there any user-facing changes? Yes. 1. This PR updates the user-facing documentation for the MATLAB interface. [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md) and [matlab/doc/matlab_interface_for_apache_arrow.md](https://github.com/apache/arrow/blob/main/matlab/doc/matlab_interface_for_apache_arrow_design.md) have been modified as described above in the "What changes are included in this PR?" section. ### Future Directions 1. Consider deleting or re-purposing `matlab/doc/matlab_interface_for_apache_arrow_design.md`. This document is fairly old and may no longer be the most effective way to document the MATLAB interface. 2. Consider using shorter filenames for `matlab_interface_for_apache_arrow_design.md` and/or `testing_guidelines_for_the_matlab_interface_to_apache_arrow.md`. 3. #28149 - consider moving code snippets in `README.md` to an `examples/` directory to simplify `README.md`. 4. Consider adding more developer-oriented documentation (e.g. like "`contributing.md`" to help guide and encourage new contributors to the MATLAB interface). * GitHub Issue: #38572 Authored-by: Kevin Gurney <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]> * GH-38213: [MATLAB] Create a superclass for tabular type MATLAB tests (i.e. for `Table` and `RecordBatch`) (#47107) ### Rationale for this change Many of the tests for `Table` and `RecordBatch `are similar. To reduce code duplication and ensure consistency, we could consider adding a shared superclass for tabular type tests. ### What changes are included in this PR? Refactored `tTable` and `tRecordBatch` to inherit from `hTabular`. `hTabular` i a new MATLAB test class that defines shared tests for `Table` and `RecordBatch`. ### Are these changes tested? Yes (see the MATLAB GitHub Actions workflow runs). ### Are there any user-facing changes? No. * GitHub Issue: #38213 Lead-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Kevin Gurney <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]> * GH-38422: [MATLAB] Add `NumNulls` property to `arrow.array.Array` class (#47116) ### Rationale for this change It would be nice if there was a `NumNulls` property on the `arrow.array.Array` base class. Currently, the only way to figure out the number of nulls is count the number of `false` values in the `Valid` array: ```matlab >> a = arrow.array([1 2 NaN 4 5 6 NaN 8 9 10 NaN]); >> invalidValues = ~a.Valid; >> numNulls = nnz(invalidValues) numNulls = 3 ``` It would be nice if `NumNulls` was already a property on the array class. As @ kou mentioned, we can use the `arrow::Array::null_count()` to get the number of nulls. ### What changes are included in this PR? Added `NumNulls` as a property of the `arrow.array.Array` abstract class. `NumNulls` is a scalar `int64` value that returns the number of null elements in the array. **Example Usage** ```matlab >> a = arrow.array([1 2 NaN 3 4 NaN 5 6 NaN]) a = Float64Array with 9 elements and 3 null values: 1 | 2 | null | ... | 5 | 6 | null >> a.NumNulls ans = int64 3 ``` ### Are these changes tested? Yes. Added test cases verifying the `NumNulls` property to these MATLAB test classes: `hNumeric`, `tBooleanArray`, `tTimestampArray`, `tTime32Array`, `tTime64Array`, `tDate32Array`, `tDate64Array`, `tListArray`, `tStringArray`, and `tStructArray`. ### Are there any user-facing changes? Yes. Users can now use the `NumNulls` property to query the number of null elements in an array. ### Future Changes 1. Add `NumNulls` as a property of `arrow.array.ChunkedArray`. * GitHub Issue: #38422 Authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]> * GH-46272: [C++] Build Arrow libraries with `-Wmissing-definitions` on gcc (#47042) ### Rationale for this change The warning option `-Wmissing-declarations` allows finding private functions that erroneously have global linkage (because they are neither static nor in the anonymous namespace). We only care about this for the public Arrow libraries, not for tests or utilities where it's harmless to have private functions that nevertheless have global linkage (and changing that would require a lot of pointless code churn). ### Are these changes tested? Yes, on builds using gcc. ### Are there any user-facing changes? No, because this is only enabled if the warning level is "CHECKIN". Release builds will by default use the "PRODUCTION" warning level. * GitHub Issue: #46272 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]> * GH-47040: [C++] Refine reset of Span to be reusable (#47004) ### Rationale for this change Original reset will cause Span can't be used again, e.g. ``` Span span; span.reset(); span.valid(); // crash RewrapSpan(span, ..); // crash ``` Instead of reset the pointer to SpanImpl, maybe we should reset content inside SpanImpl. ### What changes are included in this PR? Add reset function in SpanImpl and reimplement reset of Span ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: #47040 Authored-by: ZENOTME <[email protected]> Signed-off-by: David Li <[email protected]> * GH-47120: [R] Update NEWS for 21.0.0 (#47121) ### Rationale for this change NEWS file not updated ### What changes are included in this PR? Update NEWS.md for the release ### Are these changes tested? no ### Are there any user-facing changes? nope * GitHub Issue: #47120 Lead-authored-by: Nic Crane <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Signed-off-by: Nic Crane <[email protected]> * MINOR: [Docs][R] Add link to book to R README (#47119) ### Rationale for this change List of resources for more help in README didn't have a link to "Scaling Up With R and Arrow" ### What changes are included in this PR? Add link to Scaling Up With R and Arrow to R README ### Are these changes tested? Nope ### Are there any user-facing changes? Nope Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]> * MINOR: [R] Add Bryce to authors list (#47122) ### Rationale for this change @ amoeba is a significant contributor to the R package but isn't on the authors list ### What changes are included in this PR? Add Bryce to aforementioned list ### Are these changes tested? Nope ### Are there any user-facing changes? Nope Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]> * MINOR: [Release] Minor updates to post-release scripts (#47140) ### Rationale for this change Just doing some maintenance on post-release scripts. ### What changes are included in this PR? Updates two of our release scripts to make them work correctly. 1. post-13-homebrew.sh: Homebrew changed their default branch to main recently, see https://github.com/Homebrew/homebrew-core/pull/228218. 2. post-15-conan.sh: Makes the sed usage portable so it runs equally well on macOS. ### Are these changes tested? Yes. I ran them myself during the 21.0.0 post-release tasks. ### Are there any user-facing changes? No. Lead-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-46374: [Python][Doc] Improve docs to specify that source argument on parquet.read_table can also be a list of strings (#47142) See #46374 ### What changes are included in this PR? The docstring for `parquet.read_table` doesn't specify that the source can be a list of strings: This is regarding the change for _read_table_docstring which source can be a list of string as well ### Are there any user-facing changes? Only docs changed. * GitHub Issue: #46374 Authored-by: Soroush Rasti <[email protected]> Signed-off-by: Rok Mihevc <[email protected]> * GH-47153: [Docs][C++] Update cmake target table in build_system.rst with newly added targets (#47154) ### Rationale for this change We were missing documentation for some of the newer CMake packages and targets we've split out. This adds documentation for those (acero, compute). ### What changes are included in this PR? - Updates the table in build_system.rst to include ArrowAcero and ArrowCompute ### Are these changes tested? No, will render in CI though for others to see. ### Are there any user-facing changes? No. * GitHub Issue: #47153 Authored-by: Bryce Mecum <[email protected]> Signed-off-by: Bryce Mecum <[email protected]> * MINOR: [Release] Update versions for 22.0.0-SNAPSHOT * MINOR: [Release] Update .deb package names for 22.0.0 * MINOR: [Release] Update .deb/.rpm changelogs for 21.0.0 * MINOR: [Release] Fix issue in post-14-vcpkg.sh causing x-add-version to fail (#47156) ### Rationale for this change I ran into this running post-14-vcpkg.sh for the 21.0.0 release in https://github.com/microsoft/vcpkg/pull/46477#pullrequestreview-3036678763. If "port-version" exists in vcpkg.json, x-add-version fails with, ``` warning: In arrow, 21.0.0 is a completely new version, so there should be no "port-version". Remove "port-version" and try again. To skip this check, rerun with --skip-version-format-check . ``` This looks like a warning but it's actually a hard error and will cause your upstream PR to bounce. ### What changes are included in this PR? The script now removes the "port-version" field by default. I think the reason this worked sometimes and not others was because the field is supposed to be absent when 0 and it's usually 0 so our scripts don't need to update it. ### Are these changes tested? Yes. Locally. ### Are there any user-facing changes? No. Authored-by: Bryce Mecum <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * MINOR: [Release] Make post-11-bump-versions.sh work on macOS (#47158) ### Rationale for this change The script doesn't run out of the box on macOS. since `nproc` is not available. ### What changes are included in this PR? Makes the determination of the number of jobs dynamic and platform-specific. ### Are these changes tested? On macOS, yes. ### Are there any user-facing changes? No. Authored-by: Bryce Mecum <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47125: [CI][Dev] Fix shellcheck errors in the ci/scripts/integration_hdfs.sh (#47126) ### Rationale for this change This is the sub issue #44748. * SC2034: source_dir appears unused. Verify use (or export if used externally). * SC2086: Double quote to prevent globbing and word splitting. * SC2155: Declare and assign separately to avoid masking return values. ``` shellcheck ci/scripts/integration_hdfs.sh In ci/scripts/integration_hdfs.sh line 22: source_dir=${1}/cpp ^--------^ SC2034 (warning): source_dir appears unused. Verify use (or export if used externally). In ci/scripts/integration_hdfs.sh line 25: export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath --glob) ^-------^ SC2155 (warning): Declare and assign separately to avoid masking return values. ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: export CLASSPATH=$("$HADOOP_HOME"/bin/hadoop classpath --glob) In ci/scripts/integration_hdfs.sh line 45: pushd ${build_dir} ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: pushd "${build_dir}" For more information: https://www.shellcheck.net/wiki/SC2034 -- source_dir appears unused. Verify... https://www.shellcheck.net/wiki/SC2155 -- Declare and assign separately to ... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... ``` ### What changes are included in this PR? * SC2034: disable shellcheck * SC2086: Quote variables. * SC2155: separate variable declaration and export. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47125 Authored-by: Hiroyuki Sato <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * GH-47131: [C#] Fix day off by 1 in Date64Array (#47132) ### Rationale for this change `Date64Array.Convert(DateTimeOffset)` substracts one day on date times that are at 00:00 am and < 1970. For exampl…
Rationale for this change
Arrow deals with secrets like encryption / decryption keys which must be kept private. One way of leaking such secrets is through memory allocation where another process allocates memory that previously hold the secret, because that memory was not cleared before being freed.
What changes are included in this PR?
Uses various implementations of securely clearing memory, notably
SecureZeroMemory(Windows)memset_s(STDC)OPENSSL_cleanse(OpenSSL >= 3)explicit_bzero(glibc 2.25+)memset(fallback).Are these changes tested?
Unit tests.
Are there any user-facing changes?
This only adds the
SecureStringclass and tests. Using this new infrastructure is done in follow-up pull requests.