Bring protegrity/arrow main branch up to date #210

sofia-tekdatum · 2025-11-18T21:40:21Z

Get the protegrity/arrow project fork up-to-date.

Bring over all changes to the main branch up to the latest Apache Arrow release

Specifically pointing to Apache Arrow branch 22.0.0 RC1

No changes have been made to main (all development is in another branch), so this merge should work.

Next step is to bring over the work done in our development branch to main.

….22.0 (#46912) ### Rationale for this change Bundled Boost 1.81.0 and Apache Thrift 0.22.0 are old. It's difficult to upgrade only Boost because Apache Thrift depends on Boost. So this PR updates bundled Boost and Apache Thrift. ### What changes are included in this PR? * Update bundled Boost: * Use CMake based build instead of b2 * Use FetchContent not ExternalProject * Stop using our trimmed Boost source archive * Update bundled Apache Thrift: * Use FetchContent not ExternalProject ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #46745 * GitHub Issue: #46740 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…for Byte Stream Split (#46963) Thanks for opening a pull request! ### Rationale for this change Lot of linux systems ship arrow with SSE4.2, but the AVX2 instructions are quite available. For byte stream split, they are faster than SSE4.2. ### What changes are included in this PR? - Make the xsimd functions refactored in #46789 to make them arch independent. - Use dynamic dispatch to AVX2 at runtime if available (it was considered that builds without SSE4.2 or Neon at compile time were not so popular to add them to the dynamic dispatch). ### Are these changes tested? Yes, the exisiting tests already cover the code. ### Are there any user-facing changes? No * GitHub Issue: #46962 Lead-authored-by: AntoinePrv <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…converted type NONE (#44739) ### Rationale for this change We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate. Unfortunately, the Parquet stream writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type. ### What changes are included in this PR? Allow the Parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader. The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only BYTE_ARRAY with NONE type are allowed). ### Are these changes tested? Yes. ### Are there any user-facing changes? Only a new feature. * GitHub Issue: #42971 Authored-by: Adrien Destugues <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…RRAY and FLBA (#47013) ### Rationale for this change When reading a Parquet leaf column as Arrow, we [presize the Arrow builder](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/arrow/reader.cc#L487-L488) so as to avoid spurious reallocations during incremental Parquet decoding calls. However, the Reserve method on RecordReader will [only properly reserve values](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/column_reader.cc#L1693-L1696) for non-FLBA non-BYTE_ARRAY physical types. The result is that, on some of our micro-benchmarks, we spend a significant amount of time reallocating data on the ArrayBuilder. ### What changes are included in this PR? Properly reserve space on Array builders when reading Parquet data as Arrow. Note that, when reading into Binary or LargeBinary, this doesn't avoid reallocations for the actual data. However, for FixedSizeBinary and BinaryView, this is sufficient to avoid any reallocations. Benchmark numbers on my local machine (Ubuntu 24.04): ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Non-regressions: (250) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1 3.295 GiB/sec 7.834 GiB/sec 137.771 {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 118} BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1 3.453 GiB/sec 8.148 GiB/sec 135.957 {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 119} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100 1.360 GiB/sec 1.780 GiB/sec 30.870 {'family_index': 13, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100 1.360 GiB/sec 1.780 GiB/sec 30.861 {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0 1.292 GiB/sec 1.662 GiB/sec 28.666 {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0 1.304 GiB/sec 1.665 GiB/sec 27.691 {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46} BM_ReadBinaryViewColumn/null_probability:99/unique_values:32 959.085 MiB/sec 1.185 GiB/sec 26.568 {'family_index': 15, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99 1.012 GiB/sec 1.210 GiB/sec 19.557 {'family_index': 13, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1 1.011 GiB/sec 1.187 GiB/sec 17.407 {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99 1.024 GiB/sec 1.201 GiB/sec 17.206 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36} BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1 1.023 GiB/sec 1.197 GiB/sec 17.016 {'family_index': 15, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadBinaryColumn/null_probability:99/unique_values:32 541.347 MiB/sec 632.640 MiB/sec 16.864 {'family_index': 14, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1 954.762 MiB/sec 1.084 GiB/sec 16.272 {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1 970.997 MiB/sec 1.100 GiB/sec 15.969 {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 34} BM_ReadBinaryColumn/null_probability:99/unique_values:-1 592.605 MiB/sec 666.605 MiB/sec 12.487 {'family_index': 14, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10} BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1 587.604 MiB/sec 659.154 MiB/sec 12.177 {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10} BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1 867.001 MiB/sec 962.427 MiB/sec 11.006 {'family_index': 15, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50 473.040 MiB/sec 522.948 MiB/sec 10.551 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17} BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1 1.633 GiB/sec 1.800 GiB/sec 10.197 {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5} BM_ReadStructOfListColumn/50 466.944 MiB/sec 513.407 MiB/sec 9.951 {'family_index': 20, 'per_family_instance_index': 2, 'run_name': 'BM_ReadStructOfListColumn/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1 894.649 MiB/sec 976.595 MiB/sec 9.160 {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50 479.717 MiB/sec 523.293 MiB/sec 9.084 {'family_index': 13, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17} BM_ReadBinaryColumn/null_probability:50/unique_values:-1 613.860 MiB/sec 667.963 MiB/sec 8.814 {'family_index': 14, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1 1.479 GiB/sec 1.608 GiB/sec 8.761 {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1 1.628 GiB/sec 1.762 GiB/sec 8.235 {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5} BM_ReadStructOfListColumn/0 760.221 MiB/sec 822.339 MiB/sec 8.171 {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ReadStructOfListColumn/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47} BM_ReadBinaryViewColumn/null_probability:1/unique_values:32 843.826 MiB/sec 912.397 MiB/sec 8.126 {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryViewColumn/null_probability:50/unique_values:32 699.538 MiB/sec 755.468 MiB/sec 7.995 {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024 3.724 GiB/sec 4.007 GiB/sec 7.597 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 176027} BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1 1.474 GiB/sec 1.586 GiB/sec 7.591 {'family_index': 15, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryColumn/null_probability:0/unique_values:-1 1.114 GiB/sec 1.192 GiB/sec 7.005 {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:1/unique_values:-1 1.022 GiB/sec 1.091 GiB/sec 6.715 {'family_index': 14, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1 1.101 GiB/sec 1.174 GiB/sec 6.557 {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000 18.019 MiB/sec 19.100 MiB/sec 5.997 {'family_index': 33, 'per_family_instance_index': 14, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6295} BM_ReadBinaryViewColumn/null_probability:0/unique_values:32 893.151 MiB/sec 945.900 MiB/sec 5.906 {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000 20.243 MiB/sec 21.404 MiB/sec 5.733 {'family_index': 33, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7257} BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1 620.583 MiB/sec 655.859 MiB/sec 5.684 {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:0/unique_values:32 751.375 MiB/sec 793.728 MiB/sec 5.637 {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:50/unique_values:32 537.693 MiB/sec 567.159 MiB/sec 5.480 {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100 44.112 MiB/sec 46.474 MiB/sec 5.355 {'family_index': 33, 'per_family_instance_index': 6, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15273} BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000 20.750 MiB/sec 21.843 MiB/sec 5.265 {'family_index': 30, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7387} BM_ReadColumn<false,Int32Type>/-1/10 7.621 GiB/sec 8.019 GiB/sec 5.223 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumn<false,Int32Type>/-1/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 137} [ ... snip non-significant changes ... ] --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Regressions: (4) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BM_ReadListColumn/99 1.452 GiB/sec 1.379 GiB/sec -5.006 {'family_index': 21, 'per_family_instance_index': 3, 'run_name': 'BM_ReadListColumn/99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 129} BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024 270.542 MiB/sec 256.345 MiB/sec -5.248 {'family_index': 27, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32060} BM_ArrowBinaryPlain/DecodeArrow_Dict/65536 172.371 MiB/sec 162.455 MiB/sec -5.753 {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 319} BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024 189.008 MiB/sec 176.900 MiB/sec -6.406 {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22292} ``` ### Are these changes tested? By existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #47012 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change The system package for xsimd is too old on Fedora 39, use bundled version instead. ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #47037 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…{exac,approximate} (#46385) ### Rationale for this change `ARROW:average_byte_width:exact` and `ARROW:average_byte_width:approximate` statistics attributes are missing in `arrow::ArrayStatistics`. ### What changes are included in this PR? Add `average_byte_width` and `is_average_byte_width_exact` member variables to `arrow::ArrayStatistics`. ### Are these changes tested? Yes, I run the relevant unit tests ### Are there any user-facing changes? Yes * GitHub Issue: #45639 Lead-authored-by: Arash Andishgar <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…ng-point comparison when values share the same memory (#47044) ### Rationale for this change As discussed [here](#46938 (comment)), this is a minor enhancement to `arrow::ChunkedArray::Equals`. ### What changes are included in this PR? A minor improvement to `arrow::ChunkedArray::Equals` to handle the case where chunked arrays share the same underlying memory. ### Are these changes tested? Yes, I ran the relevant unit tests. ### Are there any user-facing changes? No. * GitHub Issue: #46938 Authored-by: Arash Andishgar <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

) ### Rationale for this change OpenSUSE 15.5 ships old GCC (7.5) that doesn't have enough C++17 support. ### What changes are included in this PR? Use Ubuntu 20.04 that ships GCC 9.3 instead of OpenSUSE 15.5. Ubuntu 20.04 reached EOL but we can use it for now. We discussed why we need OpenSUSE 15.5 based job at #45718 (comment) . We have the job because https://arrow.apache.org/docs/developers/cpp/building.html said "gcc 7.1 and higher should be sufficient". We need require GCC 9 or later with #46813. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46989 Lead-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change If we use custom gtest main with MSVC, it always reports "SEH exception". ### What changes are included in this PR? Remove MSVC version check. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47033 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…ecordBatch` by calling `arrow.recordBatch` with no input arguments (#47060) ### Rationale for this change Currently, the `arrow.table` construction function will return an empty `arrow.tabular.Table` if no input arguments are passed to the function. However, `arrow.recordBatch` throws an error in this case. We should consider making `arrow.recordBatch` behave consistently with `arrow.table` in this case. This should be relatively straightforward to implement. We can just set the input argument `T` to default to `table.empty(0,0)` in the `arguments` block of the `recordBatch` function, in the same way that `arrow.table` does: https://github.com/apache/arrow/blob/73454b7040fbea3a187c1bfabd7ea02d46ca3c41/matlab/src/matlab/%2Barrow/table.m#L21 ### What changes are included in this PR? Updated the `arrow.recordBatch` function to return an `arrow.tabular.RecordBatch` instance with zero columns and zero rows if called with zero input arguments. Before this change, the `arrow.recordBatch` function would throw an error if called with zero input arguments. **Example Usage:** ```matlab >> rb = arrow.recordBatch() rb = Arrow RecordBatch with 0 rows and 0 columns ``` ### Are these changes tested? Yes. Added a new test case to `tRecordBatch` called `ConvenienceConstructorZeroArguments`. ### Are there any user-facing changes? Yes. Users can now call `arrow.recordBatch` with zero input arguments. * GitHub Issue: #38211 Authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]>

### Rationale for this change We must use GPG key ID not GPG key itself for `gpg --local-user`. ### What changes are included in this PR? Use `ARROW_GPG_KEY_UID`. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47061 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change `RELEASE_TARBALL` is registered to `GITHUB_ENV` but isn't defined in this context. ### What changes are included in this PR? Define `RELEASE_TARBARLL`. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47063 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change v0.55.0 is the latest version. v0.39.0 depends on old grpcio (1.59.0) that doesn't provide wheels for Python 3.13. ### What changes are included in this PR? Update the default Google Cloud Storage Testbench version. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47047 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change We must use `timeout-minutes` not `timeout` for timeout. ### What changes are included in this PR? Fix key. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47065 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…#47068) ### Rationale for this change We must use `inputs` not `input` for inputs for workflow dispatch: https://docs.github.com/en/actions/reference/contexts-reference#inputs-context ### What changes are included in this PR? Fix the context name. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47067 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change We need `needs: target` for jobs that use the `target` job outputs. ### What changes are included in this PR? Add missing `needs: target`s. ### Are these changes tested? No. ### Are there any user-facing changes? No. * GitHub Issue: #47069 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

) ### Rationale for this change Apache Rat doesn't like hard links. ### What changes are included in this PR? Use `tar --hard-dereference`. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47071 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…ive (#47076) ### Rationale for this change The current source archive creation is reproducible when we use the same Git working tree. But it's not reproducible when we use different Git working trees. ### What changes are included in this PR? Use the committer date of the target commit instead of the `charp/` mtime in the current Git working tree for `csharp/` in the source archive. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47074 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

… check (#47079) ### Rationale for this change We need to use `dev/release/utils-create-release-tarball.sh` that exists in the target apache/arrow directory. ### What changes are included in this PR? Use `dev/release/utils-create-release-tarball.sh` in cloned apache/arrow. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47078 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change Fedora 39 reached EOL on 2024-11-26: https://docs.fedoraproject.org/en-US/releases/eol/ ### What changes are included in this PR? Use Fedora 42 that is the latest release. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47045 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change azure-sdk-for-cpp uses `export(PACKAGE)` https://cmake.org/cmake/help/latest/command/export.html#package . It changes user package registry (`~/.cmake/packages/`) https://cmake.org/cmake/help/latest/manual/cmake-packages.7.html#user-package-registry . It's outside of a build directory. If user package registry is changed, other build may be failed. ### What changes are included in this PR? Disable `export(PACKAGE)`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #47005 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change There are 2 problems on verification of reproducible source archive: 1. CI on macOS isn't prepared correctly 2. Some verification environments may not have required tools FYI: We need the following to check reproducible build on macOS: * Ensure using apache/arrow for `GITHUB_REPOSITORY` * `GITHUB_REPOSITORY` is defined automatically on GitHub Actions. Our Crossbow based verification job has `GITHUB_REPOSITORY=ursacomputing/crossbow` by default. * GNU tar * GNU gzip ### What changes are included in this PR? For the problem1: * Set `GITHUB_REPOSITORY` explicitly * Install GNU gzip (GNU tar is already installed) For the problem2: * Add `TEST_SOURCE_REPRODUCIBLE` that is `0` by default * Set `TEST_SOURCE_REPRODUCIBLE=1` on CI * At least one PMC member must set `TEST_SOURCE_REPRODUCIBLE=1` on release verification ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47081 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…on (#47093) ### Rationale for this change There are some problems in APT/Yum previous version verification: * There are some typos * Can't reuse `dev/release/verify-release-candidate.sh` for the previous version verification ### What changes are included in this PR? * Fix typos * Reuse `dev/release/verify-release-candidate.sh` for the previous version verification * Ignore the previous version verification result for now * We may revisit this once we can fix the current problems. See the added comments for details. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47092 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…#47098) ### Rationale for this change Fixing #41110. ### What changes are included in this PR? Handle empty stream in `ArrowStreamReaderImplementation`. Similar changes have *not* been made to `ArrowMemoryReaderImplementation` or `ArrowFileReaderImplementation`. ### Are these changes tested? Two basic unit tests have been created to validate the new behavior. This might not be sufficient to cover all cases where an empty stream should be handled without an exception occurring. Original change by @ voidstar69; this takes his change and applies the PR feedback to it. * GitHub Issue: #41110 Lead-authored-by: voidstar69 <[email protected]> Co-authored-by: Curt Hagenlocher <[email protected]> Signed-off-by: Curt Hagenlocher <[email protected]>

Performed the following updates: - Updated BenchmarkDotNet from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props - Updated BenchmarkDotNet.Diagnostics.Windows from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props - Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj - Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props - Updated Grpc.AspNetCore from 2.67.0 to 2.71.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props - Updated Grpc.Tools from 2.71.0 to 2.72.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj - Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.0 in /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj - Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props - Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props - Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Lead-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Curt Hagenlocher <[email protected]> Signed-off-by: Curt Hagenlocher <[email protected]>

… classes (#47059) ### Rationale for this change As a follow up to #38531 (see #38531 (comment)), we should consider adding a `validate` method to all `arrow.array.Array` classes, which would allow users to explicitly validate the contents of an `arrow.array.Array` after it is created. ### What changes are included in this PR? Added `validate()` as a method to `arrow.array.Array`. This method has one name-value pair which is called `ValidationMode`. `ValidationMode` can either be specified as `"minimal"` or `"full"`. By default, `ValidationMode="minimal"`. **Example Usage:** ```matlab >> offsets = arrow.array(int32([0 1 0])); >> values = arrow.array(1:3); >> array = arrow.array.ListArray.fromArrays(offsets, values); >> array.validate(ValidationMode="full") >> array.validate(ValidationMode="full") Error using . (line 63) Offset invariant failure: non-monotonic offset at slot 2: 0 < 1 Error in arrow.array.Array/validate (line 68) obj.Proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode))); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` ### Are these changes tested? Yes. Added a MATLAB test class called `tValidateArray.m`. ### Are there any user-facing changes? Yes. There is a new public method that is accessible via any subclass of `arrow.array.Array`. * GitHub Issue: #38532 Lead-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Sarah Gilmore <[email protected]> Co-authored-by: Kevin Gurney <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]>

…when reaching page size limit (#47032) ### Rationale for this change Ensures Parquet pages are written when the buffered data reaches the configured page size, while also ensuring pages are only split on record boundaries when required. Without this fix, page sizes can grow unbounded until the row group is closed. ### What changes are included in this PR? Fixes off-by-one error in logic to control when pages can be written. ### Are these changes tested? Yes, added a new unit test. ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This bug could cause a crash when writing a large number of rows of a repeated column and reaching a page size > max int32. * GitHub Issue: #47027 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Adam Reeve <[email protected]>

…on_arrow.sh (#47089) ### Rationale for this change This is the sub issue #44748. * SC2046: Quote this to prevent word splitting. * SC2086: Double quote to prevent globbing and word splitting. * SC2102: Ranges can only match single chars (mentioned due to duplicates). * SC2223: This default assignment may cause DoS due to globbing. Quote it. ``` ci/scripts/integration_arrow.sh In ci/scripts/integration_arrow.sh line 27: : ${ARROW_INTEGRATION_CPP:=ON} ^--------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it. In ci/scripts/integration_arrow.sh line 28: : ${ARROW_INTEGRATION_CSHARP:=ON} ^-----------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it. In ci/scripts/integration_arrow.sh line 30: : ${ARCHERY_INTEGRATION_TARGET_IMPLEMENTATIONS:=cpp,csharp} ^-- SC2223 (info): This default assignment may cause DoS due to globbing. Quote it. In ci/scripts/integration_arrow.sh line 33: . ${arrow_dir}/ci/scripts/util_log.sh ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: . "${arrow_dir}"/ci/scripts/util_log.sh In ci/scripts/integration_arrow.sh line 36: pip install -e $arrow_dir/dev/archery[integration] ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^-----------^ SC2102 (info): Ranges can only match single chars (mentioned due to duplicates). Did you mean: pip install -e "$arrow_dir"/dev/archery[integration] In ci/scripts/integration_arrow.sh line 66: --with-cpp=$([ "$ARROW_INTEGRATION_CPP" == "ON" ] && echo "1" || echo "0") \ ^-- SC2046 (warning): Quote this to prevent word splitting. In ci/scripts/integration_arrow.sh line 67: --with-csharp=$([ "$ARROW_INTEGRATION_CSHARP" == "ON" ] && echo "1" || echo "0") \ ^-- SC2046 (warning): Quote this to prevent word splitting. In ci/scripts/integration_arrow.sh line 68: --gold-dirs=$gold_dir/0.14.1 \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/0.14.1 \ In ci/scripts/integration_arrow.sh line 69: --gold-dirs=$gold_dir/0.17.1 \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/0.17.1 \ In ci/scripts/integration_arrow.sh line 70: --gold-dirs=$gold_dir/1.0.0-bigendian \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/1.0.0-bigendian \ In ci/scripts/integration_arrow.sh line 71: --gold-dirs=$gold_dir/1.0.0-littleendian \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/1.0.0-littleendian \ In ci/scripts/integration_arrow.sh line 72: --gold-dirs=$gold_dir/2.0.0-compression \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/2.0.0-compression \ In ci/scripts/integration_arrow.sh line 73: --gold-dirs=$gold_dir/4.0.0-shareddict \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --gold-dirs="$gold_dir"/4.0.0-shareddict \ For more information: https://www.shellcheck.net/wiki/SC2046 -- Quote this to prevent word splitt... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... https://www.shellcheck.net/wiki/SC2102 -- Ranges can only match single char... ``` ### What changes are included in this PR? Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47088 Authored-by: Hiroyuki Sato <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change See #46629. ### What changes are included in this PR? This PR updates the `DatasetFactory.inspect` method so that it accepts new `promote_options` and `fragments` parameters. Since we parse string into a `MergeOptions` struct in three different places, this PR defines the helper function `_parse_field_merge_options`. ### Are these changes tested? Yes. ### Are there any user-facing changes? This adds optional parameters to a public method. It changes the default behavior from checking one fragment to checking all fragments (the old documentation said it inspected "all data fragments" even though it didn't). * GitHub Issue: #46629 Lead-authored-by: Hadrian Reppas <[email protected]> Co-authored-by: Hadrian Reppas <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change Cryptographic keys must be kept private. Using the new `arrow::util::SecureString` is vital for storing secrets securely. ### What changes are included in this PR? Uses the `arrow::util::SecureString` introduced in #46626 for cryptographic keys throughout Parquet encryption. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? APIs that hand over secrets to Arrow require the secret to be encapsulated in a `SecureString`. **This PR includes breaking changes to public APIs.** TODO: - provide instructions for migration Supersedes #12890. * GitHub Issue: #31603 Lead-authored-by: Enrico Minack <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…ax (#47622) ### Rationale for this change Don't need base pipe ### What changes are included in this PR? Update package to use native pipe ### Are these changes tested? Sure ### Are there any user-facing changes? Nah * GitHub Issue: #47106 Authored-by: Nic Crane <[email protected]> Signed-off-by: Bryce Mecum <[email protected]>

… error (#47660) ### Rationale for this change Fixes issue at #47659 ### What changes are included in this PR? Include add gmock as a shared private link library to `arrow_flight_testing` ### Are these changes tested? Build for `arrow_flight_testing` succeeds on my Windows environment ### Are there any user-facing changes? No * GitHub Issue: #47659 Authored-by: Alina (Xi) Li <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…47454) ### Rationale for this change Currently, the `maps_as_pydicts` parameter to `MapScalar.as_py` does not work on nested maps. See below: ``` import pyarrow as pa t = pa.struct([pa.field("x", pa.map_(pa.string(), pa.map_(pa.string(), pa.int8())))]) v = {"x": {"a": {"1": 1}}} s = pa.scalar(v, type=t) print(s.as_py(maps_as_pydicts="strict")) # {'x': {'a': [('1', 1)]}} ``` In this ^ case, I'd want to get the value: `{'x': {'a': {'1': 1}}}`, such that round trips would work as expected. ### What changes are included in this PR? Begin to apply the `maps_as_pydicts` to nested values in map types as well, update relevant test. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, just a user-facing fix. * GitHub Issue: #47380 Lead-authored-by: Johanna <[email protected]> Co-authored-by: zzkv <[email protected]> Co-authored-by: Johanna <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…repository (#47600) ### Rationale for this change There are several things that make this change wanted. We want to move some CI jobs from `ursacomputing/crossbow` to `apache/arrow`. Moving the Linux Packaging jobs will allow us to automate some release tasks and potentially (if we are able to make reproducible builds for linux packaging work) add automated signing to them avoiding having to require a PMC signature for the Linux packaging artifacts. ### What changes are included in this PR? - Move `check_labels` and `report_ci` jobs to independent reusable workflows. - Update `cpp_extra` to use those. - Create new `linux_packaging.yml` workflow replicating work that was done on crossbow. Integrate that workflow with `check_labels` and `report_ci` - Update release binary submit and binary download to run workflow when tag is pushed and download the artifacts from the release instead of from the crossbow repository. ### Are these changes tested? Some via CI on fork and some manual testing. ### Are there any user-facing changes? No * GitHub Issue: #47582 Lead-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…e.g. "+04:30") (#12865) ARROW-14477: #30036 Currently timestamp arrays have unit `timestamp(unit, zone name)`. This would add "offset timezones" where timestamp array would also support units like `timestamp(unit, "+/-HH:MM")`. * GitHub Issue: #30036 Lead-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Rok <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>

…rted image (#47730) ### Rationale for this change Old image fails due to debian update ### What changes are included in this PR? Use newer image ### Are these changes tested? Will submit crossbow run ### Are there any user-facing changes? No * GitHub Issue: #47705 Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>

### Rationale for this change #45964 changed paths of pre-built Apache Arrow C++ binaries for R. But we forgot to update the nightly upload job. ### What changes are included in this PR? Update paths in the nightly upload job. ### Are these changes tested? No... ### Are there any user-facing changes? Yes. * GitHub Issue: #47704 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Nic Crane <[email protected]>

…47743) ### Rationale for this change Valgrind would report memory leaks induced by protobuf initialization on library load, for example: ``` ==14628== 414 bytes in 16 blocks are possibly lost in loss record 22 of 26 ==14628== at 0x4914EFF: operator new(unsigned long) (vg_replace_malloc.c:487) ==14628== by 0x8D0B6CA: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.0] (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D33E62: google::protobuf::DescriptorPool::Tables::Tables() (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D340E2: google::protobuf::DescriptorPool::DescriptorPool(google::protobuf::DescriptorDatabase*, google::protobuf::DescriptorPool::ErrorCollector*) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D341A2: google::protobuf::DescriptorPool::internal_generated_pool() (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D34277: google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D9C56F: google::protobuf::internal::AddDescriptorsRunner::AddDescriptorsRunner(google::protobuf::internal::DescriptorTable const*) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x40D147D: call_init.part.0 (dl-init.c:70) ==14628== by 0x40D1567: call_init (dl-init.c:33) ==14628== by 0x40D1567: _dl_init (dl-init.c:117) ==14628== by 0x40EB2C9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) ``` This was triggered by the `libprotobuf` upgrade on conda-forge from 3.21.12 to 4.25.3. ### What changes are included in this PR? Add a Valgrind suppression for these leak reports, as there is probably not much we can do about them. ### Are these changes tested? Yes, by existing CI test. ### Are there any user-facing changes? No. * GitHub Issue: #47742 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…Parquet data (#47741) ### Rationale for this change Fix issues found by OSS-Fuzz when invalid Parquet data is fed to the Parquet reader: * https://issues.oss-fuzz.com/issues/447262173 * https://issues.oss-fuzz.com/issues/447480433 * https://issues.oss-fuzz.com/issues/447490896 * https://issues.oss-fuzz.com/issues/447693724 * https://issues.oss-fuzz.com/issues/447693728 * https://issues.oss-fuzz.com/issues/449498800 ### Are these changes tested? Yes, using the updated fuzz regression files from apache/arrow-testing#115 ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #47740 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change Mimalloc default generates LSE atomic instructions only work on armv8.1. This causes illegal instruction on armv8.0 platforms like Raspberry4. This PR sets mimalloc build flag -DMI_NO_OPT_ARCH=ON to disable LSE instruction. Please note even with flag set, compiler and libc will replace the atmoic call with an ifunc that matches hardware best at runtime. That means LSE is used only if the running platform supports it. ### What changes are included in this PR? Force mimalloc build flag -DMI_NO_OPT_ARCH=ON. ### Are these changes tested? Manually tested. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Fixes crashes on Armv8.0 platform. * GitHub Issue: #47229 Lead-authored-by: Yibo Cai <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change According to microsoft/mimalloc#1073 , mimalloc v3 is preferred over v2 for production usage. There are reports of higher than expected memory consumption with mimalloc 2.2.x, notably when reading Parquet data (example: GH-47266). ### What changes are included in this PR? Bump to mimalloc 3.1.5, which is the latest mimalloc 3.1.x release as of this writing. ### Are these changes tested? Yes, by existing tests and CI. ### Are there any user-facing changes? Hopefully not, besides a potential reduction in memory usage due to improvements in mimalloc v3. * GitHub Issue: #47588 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change There are link errors with build options for JNI on macOS. ### What changes are included in this PR? `ARROW_BUNDLED_STATIC_LIBS` has CMake target names defined in Apache Arrow not `find_package()`-ed target names. So we should use `aws-c-common` not `AWS::aws-c-common`. Recent aws-c-common or something use the Network framework. So add `Network` to `Arrow::arrow_bundled_dependencies` dependencies. Don't use `compute/kernels/temporal_internal.cc` in `libarrow.dylib` and `libarrow_compute.dylib` to avoid duplicated symbols error. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #47748 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

### Rationale for this change This is for preventing to break Apache Arrow Java JNI use case on Linux. ### What changes are included in this PR? * Add a CI job that uses build options for JNI use case * Install more packages in manylinux image that is also used by JNI build ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47632 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change `archery docker push` doesn't support custom Docker registry such as ghcr.io. ### What changes are included in this PR? Parse Docker image tag and specify Docker registry name to `docker push` if it's specified in the tag. Docker image tag format: `[HOST[:PORT]/]NAMESPACE/REPOSITORY[:TAG]` See also: https://docs.docker.com/reference/cli/docker/image/tag/#description ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #47795 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…47616) ### Rationale for this change Python 3.14 is currently in a prerelease status and is expected to have a final release in October this year (https://peps.python.org/pep-0745/). We should ensure we are fully ready to support Python 3.14 for the PyArrow 22 release. ### What changes are included in this PR? This PR updates wheels for Python 3.14. ### Are these changes tested? Tested in the CI and with extended builds. ### Are there any user-facing changes? No, but users will be able to use PyArrow with Python 3.14. * GitHub Issue: #47438 --- Todo: - Update the image revision name in `.env` - Add 3.14 conda build ([arrow/dev/tasks/tasks.yml](https://github.com/apache/arrow/blob/d803afcc43f5d132506318fd9e162d33b2c3d4cd/dev/tasks/tasks.yml#L809)) when conda-forge/pyarrow-feedstock#156 is merged Follow-ups: - #47437 Authored-by: AlenkaF <[email protected]> Signed-off-by: AlenkaF <[email protected]>

…47804) Found by OSS-Fuzz, should fix https://issues.oss-fuzz.com/issues/451150486. Ensure RLE run is within bounds before reading it. Yes, by fuzz regression test in ASAN/UBSAN build. No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #47803 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

### Rationale for this change Summarise changes for release ### What changes are included in this PR? Update NEWS file ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: #47738 Authored-by: Nic Crane <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…l patch from conda (#47810) ### Rationale for this change Our verify-rc-source Windows job is failing due to patch not being available for Windows. ### What changes are included in this PR? Move patch requirement from `conda_env_cpp.txt` to `conda_env_unix.txt` ### Are these changes tested? Yes via CI and archery. ### Are there any user-facing changes? No * GitHub Issue: #47809 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

… release branch push (#47826) ### Rationale for this change We require the Linux package jobs to be triggered on RC tag creation. For example for 22.0.0, we currently push the tag `apache-arrow-22.0.0-rc0` and the release branch `release-22.0.0-rc0`. Those events are triggering builds over the same commit and the tag event gets cancelled due to a "high priority task" triggering the same jobs. This causes jobs to fail on the branch because the ARROW_VERSION is not generated. If we manually re-trigger the jobs on the tag they are successful. ### What changes are included in this PR? Remove the `release-*` branches from triggering the event to allow only the tag to run the jobs so they don't get cancelled. ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: #47819 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…ign with the variant spec (#47835) ### Rationale for this change According to the [Variant specification](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md), the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types. <img width="624" height="185" alt="image" src="https://github.com/user-attachments/assets/b0f1deb9-0301-4b94-a472-17fd9cc0df5d" /> ### What changes are included in this PR? The change includes defaulting the specification version to 1. ### Are these changes tested? The change is covered by unit test. ### Are there any user-facing changes? The Parquet files produced the variant logical type annotation `VARIANT(1)`. ``` Schema: message schema { optional group V (VARIANT(1)) = 1 { required binary metadata; required binary value; } } ``` * GitHub Issue: #47838 Lead-authored-by: Aihua <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

argmarco-tkd

LGTM. - thanks for this!

kou and others added 30 commits July 8, 2025 09:42

thisisnic and others added 23 commits October 6, 2025 13:07

MINOR: [Release] Update CHANGELOG.md for 22.0.0

c505536

MINOR: [Release] Update .deb/.rpm changelogs for 22.0.0

81563ad

MINOR: [Release] Update versions for 22.0.0

5aeb5f2

sofia-tekdatum requested review from argmarco-tkd and avalerio-tkd November 18, 2025 21:41

sofia-tekdatum linked an issue Nov 18, 2025 that may be closed by this pull request

Git merging > Sync protegrity/arrow main branch to latest Arrow release/stable version #207

Closed

avalerio-tkd approved these changes Nov 18, 2025

View reviewed changes

github-actions bot added the awaiting committer review label Nov 18, 2025

argmarco-tkd approved these changes Nov 18, 2025

View reviewed changes

sofia-tekdatum merged commit 7c19398 into protegrity:main Nov 19, 2025
31 of 63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bring protegrity/arrow main branch up to date #210

Bring protegrity/arrow main branch up to date #210

Uh oh!

sofia-tekdatum commented Nov 18, 2025

Uh oh!

argmarco-tkd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Bring protegrity/arrow main branch up to date #210

Bring protegrity/arrow main branch up to date #210

Uh oh!

Conversation

sofia-tekdatum commented Nov 18, 2025

Uh oh!

argmarco-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants