Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Apr 14, 2022

No description provided.

@pitrou pitrou changed the title EXP: SecureZero helper to securely wipe memory EXP: SecureZero helper to securely clear memory Apr 14, 2022
@pitrou pitrou marked this pull request as draft April 14, 2022 16:32
@pitrou
Copy link
Member Author

pitrou commented Apr 14, 2022

We will also need a higher-level SecureString construct that doesn't leave data around when it is copied.
cc @bkietz for opinions

@pitrou
Copy link
Member Author

pitrou commented Apr 14, 2022

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 200c1ca

Submitted crossbow builds: ursacomputing/crossbow @ actions-1858

Task Status
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-debian-10-cpp-amd64 Github Actions
test-debian-10-cpp-i386 Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-18.04-cpp Github Actions
test-ubuntu-18.04-cpp-release Github Actions
test-ubuntu-18.04-cpp-static Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-14 Github Actions
test-ubuntu-20.04-cpp-17 Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions

@pitrou
Copy link
Member Author

pitrou commented Apr 14, 2022

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 99f2375

Submitted crossbow builds: ursacomputing/crossbow @ actions-1859

Task Status
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-debian-10-cpp-amd64 Github Actions
test-debian-10-cpp-i386 Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-18.04-cpp Github Actions
test-ubuntu-18.04-cpp-release Github Actions
test-ubuntu-18.04-cpp-static Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-14 Github Actions
test-ubuntu-20.04-cpp-17 Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions

@EnricoMi
Copy link
Contributor

@pitrou I'd be happy to pick this up and draft an Arrow SecureString implementation.

@pitrou
Copy link
Member Author

pitrou commented Mar 29, 2025

Feel free to pick this up @EnricoMi . That said, I think we can also start by being less ambitious and start by exposing internal functions for secure erasure.

@EnricoMi
Copy link
Contributor

EnricoMi commented Apr 3, 2025

Here is my draft: #46017

@pitrou
Copy link
Member Author

pitrou commented Jun 9, 2025

This is superseded by PR #46626, closing.

@pitrou pitrou closed this Jun 9, 2025
@pitrou pitrou deleted the secure-zero branch June 9, 2025 09:33
pitrou added a commit that referenced this pull request Jul 15, 2025
### Rationale for this change
Cryptographic keys must be kept private. Using the new `arrow::util::SecureString` is vital for storing secrets securely.

### What changes are included in this PR?
Uses the `arrow::util::SecureString` introduced in #46626 for cryptographic keys throughout Parquet encryption.

### Are these changes tested?
Unit tests.

### Are there any user-facing changes?
APIs that hand over secrets to Arrow require the secret to be encapsulated in a `SecureString`.

**This PR includes breaking changes to public APIs.**

TODO:
- provide instructions for migration

Supersedes  #12890.

* GitHub Issue: #31603

Lead-authored-by: Enrico Minack <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
sofia-tekdatum added a commit to protegrity/arrow that referenced this pull request Nov 19, 2025
* GH-46745: [C++] Update bundled Boost to 1.88.0 and Apache Thrift to 0.22.0 (#46912)

### Rationale for this change

Bundled Boost 1.81.0 and Apache Thrift 0.22.0 are old.

It's difficult to upgrade only Boost because Apache Thrift depends on Boost. So this PR updates bundled Boost and Apache Thrift. 

### What changes are included in this PR?

* Update bundled Boost:
  * Use CMake based build instead of b2
  * Use FetchContent not ExternalProject 
  * Stop using our trimmed Boost source archive
* Update bundled Apache Thrift:
  * Use FetchContent not ExternalProject 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #46745
* GitHub Issue: #46740

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-46962: [C++][Parquet] Generic xsimd function and dynamic dispatch for Byte Stream Split (#46963)

Thanks for opening a pull request!

### Rationale for this change
Lot of linux systems ship arrow with SSE4.2, but the AVX2 instructions are quite available.
For byte stream split, they are faster than SSE4.2.

### What changes are included in this PR?
- Make the xsimd functions refactored in #46789 to make them arch independent.
- Use dynamic dispatch to AVX2 at runtime if available (it was considered that builds without SSE4.2 or Neon at compile time were not so popular to add them to the dynamic dispatch).

### Are these changes tested?
Yes, the exisiting tests already cover the code.

### Are there any user-facing changes?
No

* GitHub Issue: #46962

Lead-authored-by: AntoinePrv <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE (#44739)

### Rationale for this change

We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.

Unfortunately, the Parquet stream writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.

### What changes are included in this PR?

Allow the Parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.

The reader part has already been fixed in 4d825497cb04c9e1c288000a7a8f75786cc487ff and this uses a similar implementation, but with a stricter set of "exceptions" (only BYTE_ARRAY with NONE type are allowed).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Only a new feature.

* GitHub Issue: #42971

Authored-by: Adrien Destugues <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-47012: [C++][Parquet] Reserve values correctly when reading BYTE_ARRAY and FLBA (#47013)

### Rationale for this change

When reading a Parquet leaf column as Arrow, we [presize the Arrow builder](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/arrow/reader.cc#L487-L488) so as to avoid spurious reallocations during incremental Parquet decoding calls.

However, the Reserve method on RecordReader will [only properly reserve values](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/column_reader.cc#L1693-L1696) for non-FLBA non-BYTE_ARRAY physical types.

The result is that, on some of our micro-benchmarks, we spend a significant amount of time reallocating data on the ArrayBuilder. 

### What changes are included in this PR?

Properly reserve space on Array builders when reading Parquet data as Arrow. Note that, when reading into Binary or LargeBinary, this doesn't avoid reallocations for the actual data. However, for FixedSizeBinary and BinaryView, this is sufficient to avoid any reallocations.

Benchmark numbers on my local machine (Ubuntu 24.04):
```
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (250)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                 benchmark         baseline        contender  change %                                                                                                                                                                                                                                     counters
                          BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1    3.295 GiB/sec    7.834 GiB/sec   137.771                               {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 118}
                BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1    3.453 GiB/sec    8.148 GiB/sec   135.957                     {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 119}
                BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100    1.360 GiB/sec    1.780 GiB/sec    30.870                      {'family_index': 13, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49}
                          BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100    1.360 GiB/sec    1.780 GiB/sec    30.861                                {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49}
                  BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0    1.292 GiB/sec    1.662 GiB/sec    28.666                        {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47}
                            BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0    1.304 GiB/sec    1.665 GiB/sec    27.691                                  {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46}
                              BM_ReadBinaryViewColumn/null_probability:99/unique_values:32  959.085 MiB/sec    1.185 GiB/sec    26.568                                     {'family_index': 15, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                 BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99    1.012 GiB/sec    1.210 GiB/sec    19.557                       {'family_index': 13, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36}
                BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1    1.011 GiB/sec    1.187 GiB/sec    17.407                       {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                           BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99    1.024 GiB/sec    1.201 GiB/sec    17.206                                 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36}
                              BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1    1.023 GiB/sec    1.197 GiB/sec    17.016                                     {'family_index': 15, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                                  BM_ReadBinaryColumn/null_probability:99/unique_values:32  541.347 MiB/sec  632.640 MiB/sec    16.864                                         {'family_index': 14, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                            BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1  954.762 MiB/sec    1.084 GiB/sec    16.272                                  {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33}
                  BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1  970.997 MiB/sec    1.100 GiB/sec    15.969                        {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 34}
                                  BM_ReadBinaryColumn/null_probability:99/unique_values:-1  592.605 MiB/sec  666.605 MiB/sec    12.487                                        {'family_index': 14, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10}
                    BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1  587.604 MiB/sec  659.154 MiB/sec    12.177                          {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10}
                              BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1  867.001 MiB/sec  962.427 MiB/sec    11.006                                     {'family_index': 15, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                           BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50  473.040 MiB/sec  522.948 MiB/sec    10.551                                 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17}
                               BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1    1.633 GiB/sec    1.800 GiB/sec    10.197                                      {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5}
                                                              BM_ReadStructOfListColumn/50  466.944 MiB/sec  513.407 MiB/sec     9.951                                                                    {'family_index': 20, 'per_family_instance_index': 2, 'run_name': 'BM_ReadStructOfListColumn/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27}
                BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1  894.649 MiB/sec  976.595 MiB/sec     9.160                       {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                 BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50  479.717 MiB/sec  523.293 MiB/sec     9.084                       {'family_index': 13, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17}
                                  BM_ReadBinaryColumn/null_probability:50/unique_values:-1  613.860 MiB/sec  667.963 MiB/sec     8.814                                         {'family_index': 14, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                 BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1    1.479 GiB/sec    1.608 GiB/sec     8.761                        {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                 BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1    1.628 GiB/sec    1.762 GiB/sec     8.235                        {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5}
                                                               BM_ReadStructOfListColumn/0  760.221 MiB/sec  822.339 MiB/sec     8.171                                                                     {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ReadStructOfListColumn/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47}
                               BM_ReadBinaryViewColumn/null_probability:1/unique_values:32  843.826 MiB/sec  912.397 MiB/sec     8.126                                      {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                              BM_ReadBinaryViewColumn/null_probability:50/unique_values:32  699.538 MiB/sec  755.468 MiB/sec     7.995                                     {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                            BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024    3.724 GiB/sec    4.007 GiB/sec     7.597                                               {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 176027}
                               BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1    1.474 GiB/sec    1.586 GiB/sec     7.591                                      {'family_index': 15, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                                   BM_ReadBinaryColumn/null_probability:0/unique_values:-1    1.114 GiB/sec    1.192 GiB/sec     7.005                                          {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                   BM_ReadBinaryColumn/null_probability:1/unique_values:-1    1.022 GiB/sec    1.091 GiB/sec     6.715                                          {'family_index': 14, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                     BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1    1.101 GiB/sec    1.174 GiB/sec     6.557                            {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
 BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000   18.019 MiB/sec   19.100 MiB/sec     5.997    {'family_index': 33, 'per_family_instance_index': 14, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6295}
                               BM_ReadBinaryViewColumn/null_probability:0/unique_values:32  893.151 MiB/sec  945.900 MiB/sec     5.906                                      {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
 BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000   20.243 MiB/sec   21.404 MiB/sec     5.733    {'family_index': 33, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7257}
                    BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1  620.583 MiB/sec  655.859 MiB/sec     5.684                           {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                   BM_ReadBinaryColumn/null_probability:0/unique_values:32  751.375 MiB/sec  793.728 MiB/sec     5.637                                          {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                  BM_ReadBinaryColumn/null_probability:50/unique_values:32  537.693 MiB/sec  567.159 MiB/sec     5.480                                         {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
  BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100   44.112 MiB/sec   46.474 MiB/sec     5.355     {'family_index': 33, 'per_family_instance_index': 6, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15273}
   BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000   20.750 MiB/sec   21.843 MiB/sec     5.265      {'family_index': 30, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7387}
                                                      BM_ReadColumn<false,Int32Type>/-1/10    7.621 GiB/sec    8.019 GiB/sec     5.223                                                            {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumn<false,Int32Type>/-1/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 137}

[ ... snip non-significant changes ... ]

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (4)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                           benchmark        baseline       contender  change %                                                                                                                                                                                             counters
                                BM_ReadListColumn/99   1.452 GiB/sec   1.379 GiB/sec    -5.006                                   {'family_index': 21, 'per_family_instance_index': 3, 'run_name': 'BM_ReadListColumn/99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 129}
BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024 270.542 MiB/sec 256.345 MiB/sec    -5.248 {'family_index': 27, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32060}
          BM_ArrowBinaryPlain/DecodeArrow_Dict/65536 172.371 MiB/sec 162.455 MiB/sec    -5.753             {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 319}
    BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024 189.008 MiB/sec 176.900 MiB/sec    -6.406     {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22292}
```

### Are these changes tested?

By existing tests.

### Are there any user-facing changes?

No.

* GitHub Issue: #47012

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-47037: [CI][C++] Fix Fedora 39 CI jobs (#47038)

### Rationale for this change

The system package for xsimd is too old on Fedora 39, use bundled version instead.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #47037

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-45639: [C++][Statistics] Add support for ARROW:average_byte_width:{exac,approximate} (#46385)

### Rationale for this change

`ARROW:average_byte_width:exact` and `ARROW:average_byte_width:approximate` statistics attributes are missing in `arrow::ArrayStatistics`.

### What changes are included in this PR?

Add `average_byte_width` and `is_average_byte_width_exact`  member variables to `arrow::ArrayStatistics`.

### Are these changes tested?
Yes, I run the relevant unit tests
### Are there any user-facing changes?
Yes
* GitHub Issue: #45639

Lead-authored-by: Arash Andishgar <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-46938: [C++] Enhance arrow::ChunkedArray::Equals to support floating-point comparison when values share the same memory (#47044)

### Rationale for this change

As discussed [here](https://github.com/apache/arrow/issues/46938#issue-3187249840), this is a minor enhancement to `arrow::ChunkedArray::Equals`.

### What changes are included in this PR?

A minor improvement to `arrow::ChunkedArray::Equals` to handle the case where chunked arrays share the same underlying memory.

### Are these changes tested?

Yes, I ran the relevant unit tests.

### Are there any user-facing changes?

No.

* GitHub Issue: #46938

Authored-by: Arash Andishgar <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-46989: [CI][R] Use Ubuntu 20.04 instead of OpenSUSE for R 4.1 (#46990)

### Rationale for this change

OpenSUSE 15.5 ships old GCC (7.5) that doesn't have enough C++17 support.

### What changes are included in this PR?

Use Ubuntu 20.04 that ships GCC 9.3  instead of OpenSUSE 15.5.

Ubuntu 20.04 reached EOL but we can use it for now.

We discussed why we need OpenSUSE 15.5 based job at https://github.com/apache/arrow/issues/45718#issuecomment-2743538384 . We have the job because https://arrow.apache.org/docs/developers/cpp/building.html said "gcc 7.1 and higher should be sufficient".

We need require GCC 9 or later with #46813.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #46989

Lead-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47033: [C++][Compute] Never use custom gtest main with MSVC (#47049)

### Rationale for this change

If we use custom gtest main with MSVC, it always reports "SEH exception".

### What changes are included in this PR?

Remove MSVC version check.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #47033

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-38211: [MATLAB] Add support for creating an empty `arrow.tabular.RecordBatch` by calling `arrow.recordBatch` with no input arguments (#47060)

### Rationale for this change

Currently, the `arrow.table` construction function will return an empty `arrow.tabular.Table` if no input arguments are passed  to the function. However, `arrow.recordBatch` throws an error in this case. We should consider making `arrow.recordBatch` behave consistently with `arrow.table` in this case.

This should be relatively straightforward to implement. We can just set the input argument `T` to default to `table.empty(0,0)` in the `arguments` block of the `recordBatch` function, in the same way that `arrow.table` does:

https://github.com/apache/arrow/blob/73454b7040fbea3a187c1bfabd7ea02d46ca3c41/matlab/src/matlab/%2Barrow/table.m#L21

### What changes are included in this PR?

Updated the `arrow.recordBatch` function to return an `arrow.tabular.RecordBatch` instance with zero columns and zero rows if called with zero input arguments. Before this change, the `arrow.recordBatch` function would throw an error if called with zero input arguments.

**Example Usage:**
```matlab
>> rb = arrow.recordBatch()

rb = 

  Arrow RecordBatch with 0 rows and 0 columns
```

### Are these changes tested?

Yes. Added a new test case to `tRecordBatch` called `ConvenienceConstructorZeroArguments`.

### Are there any user-facing changes?

Yes. Users can now call `arrow.recordBatch` with zero input arguments.

* GitHub Issue: #38211

Authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>

* GH-47061: [Release] Fix wrong variable name for signing (#47062)

### Rationale for this change

We must use GPG key ID not GPG key itself for `gpg --local-user`.

### What changes are included in this PR?

Use `ARROW_GPG_KEY_UID`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47061

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47063: [Release] Define missing RELEASE_TARBALL (#47064)

### Rationale for this change

`RELEASE_TARBALL` is registered to `GITHUB_ENV` but isn't defined in this context.

### What changes are included in this PR?

Define `RELEASE_TARBARLL`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47063

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47047: [CI][C++] Use Google Cloud Storage Testbench v0.55.0 (#47048)

### Rationale for this change

v0.55.0 is the latest version. v0.39.0 depends on old grpcio (1.59.0) that doesn't provide wheels for Python 3.13.

### What changes are included in this PR?

Update the default Google Cloud Storage Testbench version.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47047

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47065: [Release] Fix timeout key in verify_rc.yml (#47066)

### Rationale for this change

We must use `timeout-minutes` not `timeout` for timeout.

### What changes are included in this PR?

Fix key.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47065

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47067: [Release] Fix wrong GitHub Actions context in verify_rc.yml (#47068)

### Rationale for this change

We must use `inputs` not `input` for inputs for workflow dispatch: https://docs.github.com/en/actions/reference/contexts-reference#inputs-context

### What changes are included in this PR?

Fix the context name.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47067

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47069: [Release] Add missing "needs: target" (#47070)

### Rationale for this change

We need `needs: target` for jobs that use the `target` job outputs.

### What changes are included in this PR?

Add missing `needs: target`s.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47069

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47071: [Release] Dereference all hard links in source archive (#47072)

### Rationale for this change

Apache Rat doesn't like hard links.

### What changes are included in this PR?

Use `tar --hard-dereference`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47071

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47074: [Release] Use reproducible mtime for csharp/ in source archive (#47076)

### Rationale for this change

The current source archive creation is reproducible when we use the same Git working tree.

But it's not reproducible when we use different Git working trees.

### What changes are included in this PR?

Use the committer date of the target commit instead of the `charp/` mtime in the current Git working tree for `csharp/` in the source archive.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47074

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47078: [Release] Ensure using cloned apache/arrow for reproducible check (#47079)

### Rationale for this change

We need to use `dev/release/utils-create-release-tarball.sh` that exists in the target apache/arrow directory.

### What changes are included in this PR?

Use `dev/release/utils-create-release-tarball.sh` in cloned apache/arrow.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47078

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47045: [CI][C++] Use Fedora 42 instead of 39 (#47046)

### Rationale for this change

Fedora 39 reached EOL on 2024-11-26: https://docs.fedoraproject.org/en-US/releases/eol/

### What changes are included in this PR?

Use Fedora 42 that is the latest release.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47045

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47005: [C++] Disable exporting CMake packages (#47006)

### Rationale for this change

azure-sdk-for-cpp uses `export(PACKAGE)` https://cmake.org/cmake/help/latest/command/export.html#package . It changes user package registry (`~/.cmake/packages/`) https://cmake.org/cmake/help/latest/manual/cmake-packages.7.html#user-package-registry . It's outside of a build directory. If user package registry is changed, other build may be failed.

### What changes are included in this PR?

Disable `export(PACKAGE)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #47005

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47081: [Release] Verify reproducible source build explicitly (#47082)

### Rationale for this change

There are 2 problems on verification of reproducible source archive:

1. CI on macOS isn't prepared correctly
2. Some verification environments may not have required tools 

FYI: We need the following to check reproducible build on macOS:

* Ensure using apache/arrow for `GITHUB_REPOSITORY`
  * `GITHUB_REPOSITORY` is defined automatically on GitHub Actions. Our Crossbow based verification job has `GITHUB_REPOSITORY=ursacomputing/crossbow` by default.
* GNU tar
* GNU gzip

### What changes are included in this PR?

For the problem1:
* Set `GITHUB_REPOSITORY` explicitly
* Install GNU gzip (GNU tar is already installed)

For the problem2:
* Add `TEST_SOURCE_REPRODUCIBLE` that is `0` by default
* Set `TEST_SOURCE_REPRODUCIBLE=1` on CI
* At least one PMC member must set `TEST_SOURCE_REPRODUCIBLE=1` on release verification

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47081

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47092: [Release] Fix errors in APT/Yum previous version verification (#47093)

### Rationale for this change

There are some problems in APT/Yum previous version verification:

* There are some typos
* Can't reuse `dev/release/verify-release-candidate.sh` for the previous version verification 

### What changes are included in this PR?

* Fix typos
* Reuse `dev/release/verify-release-candidate.sh` for the previous version verification
* Ignore the previous version verification result for now
  * We may revisit this once we can fix the current problems. See the added comments for details.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47092

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-41110: [C#] Handle empty stream in ArrowStreamReaderImplementation (#47098)

### Rationale for this change
Fixing #41110.

### What changes are included in this PR?
Handle empty stream in `ArrowStreamReaderImplementation`. Similar changes have *not* been made to `ArrowMemoryReaderImplementation` or `ArrowFileReaderImplementation`.

### Are these changes tested?
Two basic unit tests have been created to validate the new behavior. This might not be sufficient to cover all cases where an empty stream should be handled without an exception occurring.

Original change by @ voidstar69; this takes his change and applies the PR feedback to it.

* GitHub Issue: #41110

Lead-authored-by: voidstar69 <[email protected]>
Co-authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>

* MINOR: [C#] Bump BenchmarkDotNet and 6 others (#46828)

Performed the following updates:
- Updated BenchmarkDotNet from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props
- Updated BenchmarkDotNet.Diagnostics.Windows from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props
- Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj
- Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props
- Updated Grpc.AspNetCore from 2.67.0 to 2.71.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props
- Updated Grpc.Tools from 2.71.0 to 2.72.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj
- Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.0 in /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
- Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props
- Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@ dependabot rebase` will rebase this PR
- `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@ dependabot merge` will merge this PR after your CI passes on it
- `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@ dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@ dependabot reopen` will reopen this PR if it is closed
- `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@ dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

</details>

Lead-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>

* GH-38532: [MATLAB] Add a `validate` method to all `arrow.array.Array` classes (#47059)


### Rationale for this change

As a follow up to #38531 (see https://github.com/apache/arrow/pull/38531#discussion_r1377981403), we should consider adding a `validate` method to all `arrow.array.Array` classes, which would allow users to explicitly validate the contents of an `arrow.array.Array` after it is created.

### What changes are included in this PR?

Added `validate()` as a method to `arrow.array.Array`. This method has one name-value pair which is called `ValidationMode`. `ValidationMode` can either be specified as `"minimal"` or `"full"`. By default, `ValidationMode="minimal"`.

**Example Usage:**

```matlab
>> offsets = arrow.array(int32([0 1 0]));
>> values = arrow.array(1:3);
>> array = arrow.array.ListArray.fromArrays(offsets, values);
>> array.validate(ValidationMode="full")
>> array.validate(ValidationMode="full")
Error using .  (line 63)
Offset invariant failure: non-monotonic offset at slot 2: 0 < 1

Error in arrow.array.Array/validate (line 68)
             obj.Proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

```

### Are these changes tested?

Yes. Added a MATLAB test class called `tValidateArray.m`.

### Are there any user-facing changes?

Yes. There is a new public method that is accessible via any subclass of `arrow.array.Array`. 

* GitHub Issue: #38532

Lead-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Kevin Gurney <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>

* GH-47027: [C++][Parquet] Fix repeated column pages not being written when reaching page size limit (#47032)

### Rationale for this change

Ensures Parquet pages are written when the buffered data reaches the configured page size, while also ensuring pages are only split on record boundaries when required.

Without this fix, page sizes can grow unbounded until the row group is closed.

### What changes are included in this PR?

Fixes off-by-one error in logic to control when pages can be written.

### Are these changes tested?

Yes, added a new unit test.

### Are there any user-facing changes?

**This PR contains a "Critical Fix".**

This bug could cause a crash when writing a large number of rows of a repeated column and reaching a page size > max int32.
* GitHub Issue: #47027

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Adam Reeve <[email protected]>

* GH-47088: [CI][Dev] Fix shellcheck errors in the ci/scripts/integration_arrow.sh (#47089)

### Rationale for this change

This is the sub issue #44748.

* SC2046: Quote this to prevent word splitting.
* SC2086: Double quote to prevent globbing and word splitting.
* SC2102: Ranges can only match single chars (mentioned due to duplicates).
* SC2223: This default assignment may cause DoS due to globbing. Quote it.

```
ci/scripts/integration_arrow.sh

In ci/scripts/integration_arrow.sh line 27:
: ${ARROW_INTEGRATION_CPP:=ON}
  ^--------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.

In ci/scripts/integration_arrow.sh line 28:
: ${ARROW_INTEGRATION_CSHARP:=ON}
  ^-----------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.

In ci/scripts/integration_arrow.sh line 30:
: ${ARCHERY_INTEGRATION_TARGET_IMPLEMENTATIONS:=cpp,csharp}
  ^-- SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.

In ci/scripts/integration_arrow.sh line 33:
. ${arrow_dir}/ci/scripts/util_log.sh
  ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
. "${arrow_dir}"/ci/scripts/util_log.sh

In ci/scripts/integration_arrow.sh line 36:
pip install -e $arrow_dir/dev/archery[integration]
               ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                                     ^-----------^ SC2102 (info): Ranges can only match single chars (mentioned due to duplicates).

Did you mean:
pip install -e "$arrow_dir"/dev/archery[integration]

In ci/scripts/integration_arrow.sh line 66:
    --with-cpp=$([ "$ARROW_INTEGRATION_CPP" == "ON" ] && echo "1" || echo "0") \
               ^-- SC2046 (warning): Quote this to prevent word splitting.

In ci/scripts/integration_arrow.sh line 67:
    --with-csharp=$([ "$ARROW_INTEGRATION_CSHARP" == "ON" ] && echo "1" || echo "0") \
                  ^-- SC2046 (warning): Quote this to prevent word splitting.

In ci/scripts/integration_arrow.sh line 68:
    --gold-dirs=$gold_dir/0.14.1 \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/0.14.1 \

In ci/scripts/integration_arrow.sh line 69:
    --gold-dirs=$gold_dir/0.17.1 \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/0.17.1 \

In ci/scripts/integration_arrow.sh line 70:
    --gold-dirs=$gold_dir/1.0.0-bigendian \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/1.0.0-bigendian \

In ci/scripts/integration_arrow.sh line 71:
    --gold-dirs=$gold_dir/1.0.0-littleendian \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/1.0.0-littleendian \

In ci/scripts/integration_arrow.sh line 72:
    --gold-dirs=$gold_dir/2.0.0-compression \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/2.0.0-compression \

In ci/scripts/integration_arrow.sh line 73:
    --gold-dirs=$gold_dir/4.0.0-shareddict \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/4.0.0-shareddict \

For more information:
  https://www.shellcheck.net/wiki/SC2046 -- Quote this to prevent word splitt...
  https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
  https://www.shellcheck.net/wiki/SC2102 -- Ranges can only match single char...

```

### What changes are included in this PR?

Quote variables.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47088

Authored-by: Hiroyuki Sato <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-46629: [Python] Add options to DatasetFactory.inspect (#46961)

### Rationale for this change

See https://github.com/apache/arrow/issues/46629.

### What changes are included in this PR?

This PR updates the `DatasetFactory.inspect` method so that it accepts new `promote_options` and `fragments` parameters. Since we parse string into a `MergeOptions` struct in three different places, this PR defines the helper function `_parse_field_merge_options`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This adds optional parameters to a public method. It changes the default behavior from checking one fragment to checking all fragments (the old documentation said it inspected "all data fragments" even though it didn't).

* GitHub Issue: #46629

Lead-authored-by: Hadrian Reppas <[email protected]>
Co-authored-by: Hadrian Reppas <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-31603: [C++] Wrap Parquet encryption keys in SecureString (#46017)

### Rationale for this change
Cryptographic keys must be kept private. Using the new `arrow::util::SecureString` is vital for storing secrets securely.

### What changes are included in this PR?
Uses the `arrow::util::SecureString` introduced in #46626 for cryptographic keys throughout Parquet encryption.

### Are these changes tested?
Unit tests.

### Are there any user-facing changes?
APIs that hand over secrets to Arrow require the secret to be encapsulated in a `SecureString`.

**This PR includes breaking changes to public APIs.**

TODO:
- provide instructions for migration

Supersedes  #12890.

* GitHub Issue: #31603

Lead-authored-by: Enrico Minack <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-46728: [Python] Skip test_gdb.py tests if PyArrow wasn't built debug (#46755)

### Rationale for this change
As mentioned in #46728, if Arrow C++ was built debug, and PyArrow wasn't, test_gdb.py runs tests that fail.

### What changes are included in this PR?
The CMAKE_BUILD_TYPE environment variable is propagated from build into PyArrow, where it's checked to skip unit tests.

### Are these changes tested?
Yes. I have built PyArrow in release, debug, and relwithdebinfo and observed the new behavior. Because CMakeLists.txt was changed, I built PyArrow twice via setup.py and pip install, and checked the new function.

### Are there any user-facing changes?
Devs may skip unit tests that would fail. PyArrow now has build_info() with information about the build type.

* GitHub Issue: #46728

Lead-authored-by: Eric Dinse <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-46860: [C++] Making HalfFloatBuilder accept Float16 as well as uint16_t (#46981)

### Rationale for this change

#46860 Adding convenience methods for appending and retrieving Float16 to HalfFloatBuilder.

### What changes are included in this PR?

HalfFloatBuilder has functions overloaded to accept Float16, tests, and documentation.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #46860

Lead-authored-by: Eric Dinse <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-38572: [Docs][MATLAB] Update `arrow/matlab/README.md` with the latest change. (#47109)

### Rationale for this change

The documentation for the MATLAB interface hasn't been updated in a while and is out of date.

This PR includes several miscellaneous updates and enhancements to the documentation.

### What changes are included in this PR?

1. Updated the display output for the various `arrow.*` types used in the code snippets in [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md).
2. Updated the table of supported `arrow.array.Array` types in [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md).
3. Added a link to the [Testing Guidelines](https://github.com/apache/arrow/blob/main/matlab/doc/testing_guidelines_for_the_matlab_interface_to_apache_arrow.md) in the Testing section of [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md)
4. Added a link to the [Testing Guidelines](https://github.com/apache/arrow/blob/main/matlab/doc/testing_guidelines_for_the_matlab_interface_to_apache_arrow.md) in [matlab_interface_for_apache_arrow.md](https://github.com/apache/arrow/blob/main/matlab/doc/matlab_interface_for_apache_arrow_design.md).

### Are these changes tested?

N/A. 

This is a documentation only change.

I visually inspected the rendered Markdown for all modified documents to ensure they updated contents display as expected.

### Are there any user-facing changes?

Yes.

1. This PR updates the user-facing documentation for the MATLAB interface. [matlab/README.md](https://github.com/apache/arrow/blob/main/matlab/README.md) and [matlab/doc/matlab_interface_for_apache_arrow.md](https://github.com/apache/arrow/blob/main/matlab/doc/matlab_interface_for_apache_arrow_design.md) have been modified as described above in the "What changes are included in this PR?" section.

### Future Directions

1. Consider deleting or re-purposing `matlab/doc/matlab_interface_for_apache_arrow_design.md`. This document is fairly old and may no longer be the most effective way to document the MATLAB interface.
2. Consider using shorter filenames for `matlab_interface_for_apache_arrow_design.md` and/or `testing_guidelines_for_the_matlab_interface_to_apache_arrow.md`.
3. #28149 - consider moving code snippets in `README.md` to an `examples/` directory to simplify `README.md`.
4. Consider adding more developer-oriented documentation (e.g. like "`contributing.md`" to help guide and encourage new contributors to the MATLAB interface).
* GitHub Issue: #38572

Authored-by: Kevin Gurney <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>

* GH-38213: [MATLAB] Create a superclass for tabular type MATLAB tests (i.e. for `Table` and `RecordBatch`) (#47107)

### Rationale for this change

Many of the tests for `Table` and `RecordBatch `are similar. To reduce code duplication and ensure consistency, we could consider adding a shared superclass for tabular type tests.

### What changes are included in this PR?

Refactored `tTable` and `tRecordBatch` to inherit from `hTabular`. `hTabular` i a new MATLAB test class that defines shared tests for `Table` and `RecordBatch`.

### Are these changes tested?

Yes (see the MATLAB GitHub Actions workflow runs).

### Are there any user-facing changes?

No.

* GitHub Issue: #38213

Lead-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Kevin Gurney <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>

* GH-38422: [MATLAB] Add `NumNulls` property to `arrow.array.Array` class (#47116)

### Rationale for this change

It would be nice if there was a `NumNulls` property on the `arrow.array.Array` base class. Currently, the only way to figure out the number of nulls is count the number of `false` values in the `Valid` array:

```matlab
>> a = arrow.array([1 2 NaN 4 5 6 NaN 8 9 10 NaN]);
>> invalidValues = ~a.Valid;
>> numNulls = nnz(invalidValues)

numNulls =

     3
```

It would be nice if `NumNulls` was already a property on the array class. As @ kou mentioned, we can use the `arrow::Array::null_count()` to get the number of nulls.

### What changes are included in this PR?

Added `NumNulls` as a property of the `arrow.array.Array` abstract class. `NumNulls` is a scalar `int64` value that returns the number of null elements in the array.

**Example Usage**
```matlab
>> a  = arrow.array([1 2 NaN 3 4 NaN 5 6 NaN])

a = 

  Float64Array with 9 elements and 3 null values:

    1 | 2 | null | ... | 5 | 6 | null

>> a.NumNulls 

ans =

  int64

   3
```

### Are these changes tested?

Yes. Added test cases verifying the `NumNulls` property to these MATLAB test classes: `hNumeric`, `tBooleanArray`, `tTimestampArray`, `tTime32Array`, `tTime64Array`, `tDate32Array`, `tDate64Array`, `tListArray`, `tStringArray`, and `tStructArray`.

### Are there any user-facing changes?

Yes.  Users can now use the `NumNulls` property to query the number of null elements in an array.

### Future Changes

1. Add `NumNulls` as a property of `arrow.array.ChunkedArray`.

* GitHub Issue: #38422

Authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>

* GH-46272: [C++] Build Arrow libraries with `-Wmissing-definitions` on gcc (#47042)

### Rationale for this change

The warning option `-Wmissing-declarations` allows finding private functions that erroneously have global linkage (because they are neither static nor in the anonymous namespace).

We only care about this for the public Arrow libraries, not for tests or utilities where it's harmless to have private functions that nevertheless have global linkage (and changing that would require a lot of pointless code churn).

### Are these changes tested?

Yes, on builds using gcc.

### Are there any user-facing changes?

No, because this is only enabled if the warning level is "CHECKIN". Release builds will by default use the "PRODUCTION" warning level.
* GitHub Issue: #46272

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-47040: [C++] Refine reset of Span to be reusable (#47004)

### Rationale for this change

Original reset will cause Span can't be used again, e.g.
```
Span span;
span.reset();
span.valid(); // crash
RewrapSpan(span, ..); // crash
```
Instead of reset the pointer to SpanImpl, maybe we should reset content inside SpanImpl.

### What changes are included in this PR?

Add reset function in SpanImpl and reimplement reset of Span

### Are these changes tested?

No

### Are there any user-facing changes?

No

* GitHub Issue: #47040

Authored-by: ZENOTME <[email protected]>
Signed-off-by: David Li <[email protected]>

* GH-47120: [R] Update NEWS for 21.0.0 (#47121)

### Rationale for this change

NEWS file not updated

### What changes are included in this PR?

Update NEWS.md for the release

### Are these changes tested?

no

### Are there any user-facing changes?

nope
* GitHub Issue: #47120

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* MINOR: [Docs][R] Add link to book to R README (#47119)

### Rationale for this change

List of resources for more help in README didn't have a link to "Scaling Up With R and Arrow"

### What changes are included in this PR?

Add link to Scaling Up With R and Arrow to R README

### Are these changes tested?

Nope

### Are there any user-facing changes?

Nope

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* MINOR: [R] Add Bryce to authors list (#47122)

### Rationale for this change

@ amoeba is a significant contributor to the R package but isn't on the authors list

### What changes are included in this PR?

Add Bryce to aforementioned list

### Are these changes tested?

Nope

### Are there any user-facing changes?

Nope

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* MINOR: [Release] Minor updates to post-release scripts (#47140)

### Rationale for this change

Just doing some maintenance on post-release scripts.

### What changes are included in this PR?

Updates two of our release scripts to make them work correctly.

1. post-13-homebrew.sh: Homebrew changed their default branch to main recently, see https://github.com/Homebrew/homebrew-core/pull/228218.
2. post-15-conan.sh: Makes the sed usage portable so it runs equally well on macOS.

### Are these changes tested?

Yes. I ran them myself during the 21.0.0 post-release tasks.

### Are there any user-facing changes?

No.

Lead-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-46374: [Python][Doc] Improve docs to specify that source argument on parquet.read_table can also be a list of strings (#47142)

See #46374

### What changes are included in this PR?

The docstring for `parquet.read_table` doesn't specify that the source can be a list of strings:
This is regarding the change for _read_table_docstring which source can be a list of string as well
### Are there any user-facing changes?

Only docs changed.
* GitHub Issue: #46374

Authored-by: Soroush Rasti <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* GH-47153: [Docs][C++] Update cmake target table in build_system.rst with newly added targets (#47154)

### Rationale for this change

We were missing documentation for some of the newer CMake packages and targets we've split out. This adds documentation for those (acero, compute).

### What changes are included in this PR?

- Updates the table in build_system.rst to include ArrowAcero and ArrowCompute

### Are these changes tested?

No, will render in CI though for others to see.

### Are there any user-facing changes?

No.
* GitHub Issue: #47153

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>

* MINOR: [Release] Update versions for 22.0.0-SNAPSHOT

* MINOR: [Release] Update .deb package names for 22.0.0

* MINOR: [Release] Update .deb/.rpm changelogs for 21.0.0

* MINOR: [Release] Fix issue in post-14-vcpkg.sh causing x-add-version to fail (#47156)

### Rationale for this change

I ran into this running post-14-vcpkg.sh for the 21.0.0 release in https://github.com/microsoft/vcpkg/pull/46477#pullrequestreview-3036678763. If "port-version" exists in vcpkg.json, x-add-version fails with,

```
warning: In arrow, 21.0.0 is a completely new version, so there should be no "port-version". Remove "port-version" and try again. To skip this check, rerun with --skip-version-format-check .
```

This looks like a warning but it's actually a hard error and will cause your upstream PR to bounce.

### What changes are included in this PR?

The script now removes the "port-version" field by default. I think the reason this worked sometimes and not others was because the field is supposed to be absent when 0 and it's usually 0 so our scripts don't need to update it.

### Are these changes tested?

Yes. Locally.

### Are there any user-facing changes?

No.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* MINOR: [Release] Make post-11-bump-versions.sh work on macOS (#47158)

### Rationale for this change

The script doesn't run out of the box on macOS. since `nproc` is not available.

### What changes are included in this PR?

Makes the determination of the number of jobs dynamic and platform-specific.

### Are these changes tested?

On macOS, yes.

### Are there any user-facing changes?

No.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47125: [CI][Dev] Fix shellcheck errors in the ci/scripts/integration_hdfs.sh (#47126)

### Rationale for this change

This is the sub issue #44748.

* SC2034: source_dir appears unused. Verify use (or export if used externally).
* SC2086: Double quote to prevent globbing and word splitting.
* SC2155: Declare and assign separately to avoid masking return values.

```
shellcheck ci/scripts/integration_hdfs.sh

In ci/scripts/integration_hdfs.sh line 22:
source_dir=${1}/cpp
^--------^ SC2034 (warning): source_dir appears unused. Verify use (or export if used externally).

In ci/scripts/integration_hdfs.sh line 25:
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath --glob)
       ^-------^ SC2155 (warning): Declare and assign separately to avoid masking return values.
                   ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
export CLASSPATH=$("$HADOOP_HOME"/bin/hadoop classpath --glob)

In ci/scripts/integration_hdfs.sh line 45:
pushd ${build_dir}
      ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
pushd "${build_dir}"

For more information:
  https://www.shellcheck.net/wiki/SC2034 -- source_dir appears unused. Verify...
  https://www.shellcheck.net/wiki/SC2155 -- Declare and assign separately to ...
  https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
```

### What changes are included in this PR?

* SC2034: disable shellcheck
* SC2086: Quote variables.
* SC2155: separate variable declaration and export.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47125

Authored-by: Hiroyuki Sato <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47131: [C#] Fix day off by 1 in Date64Array (#47132)

### Rationale for this change
`Date64Array.Convert(DateTimeOffset)` substracts one day on date times that are at 00:00 am and < 1970.
For exampl…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants