Skip to content

Conversation

@sofia-tekdatum
Copy link
Collaborator

Get the protegrity/arrow project fork up-to-date.

Bring over all changes to the main branch up to the latest Apache Arrow release

Specifically pointing to Apache Arrow branch 22.0.0 RC1

No changes have been made to main (all development is in another branch), so this merge should work.

Next step is to bring over the work done in our development branch to main.

kou and others added 30 commits July 8, 2025 09:42
….22.0 (#46912)

### Rationale for this change

Bundled Boost 1.81.0 and Apache Thrift 0.22.0 are old.

It's difficult to upgrade only Boost because Apache Thrift depends on Boost. So this PR updates bundled Boost and Apache Thrift. 

### What changes are included in this PR?

* Update bundled Boost:
  * Use CMake based build instead of b2
  * Use FetchContent not ExternalProject 
  * Stop using our trimmed Boost source archive
* Update bundled Apache Thrift:
  * Use FetchContent not ExternalProject 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #46745
* GitHub Issue: #46740

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…for Byte Stream Split (#46963)

Thanks for opening a pull request!

### Rationale for this change
Lot of linux systems ship arrow with SSE4.2, but the AVX2 instructions are quite available.
For byte stream split, they are faster than SSE4.2.

### What changes are included in this PR?
- Make the xsimd functions refactored in #46789 to make them arch independent.
- Use dynamic dispatch to AVX2 at runtime if available (it was considered that builds without SSE4.2 or Neon at compile time were not so popular to add them to the dynamic dispatch).

### Are these changes tested?
Yes, the exisiting tests already cover the code.

### Are there any user-facing changes?
No

* GitHub Issue: #46962

Lead-authored-by: AntoinePrv <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…converted type NONE (#44739)

### Rationale for this change

We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.

Unfortunately, the Parquet stream writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.

### What changes are included in this PR?

Allow the Parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.

The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only BYTE_ARRAY with NONE type are allowed).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Only a new feature.

* GitHub Issue: #42971

Authored-by: Adrien Destugues <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…RRAY and FLBA (#47013)

### Rationale for this change

When reading a Parquet leaf column as Arrow, we [presize the Arrow builder](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/arrow/reader.cc#L487-L488) so as to avoid spurious reallocations during incremental Parquet decoding calls.

However, the Reserve method on RecordReader will [only properly reserve values](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/column_reader.cc#L1693-L1696) for non-FLBA non-BYTE_ARRAY physical types.

The result is that, on some of our micro-benchmarks, we spend a significant amount of time reallocating data on the ArrayBuilder. 

### What changes are included in this PR?

Properly reserve space on Array builders when reading Parquet data as Arrow. Note that, when reading into Binary or LargeBinary, this doesn't avoid reallocations for the actual data. However, for FixedSizeBinary and BinaryView, this is sufficient to avoid any reallocations.

Benchmark numbers on my local machine (Ubuntu 24.04):
```
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (250)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                 benchmark         baseline        contender  change %                                                                                                                                                                                                                                     counters
                          BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1    3.295 GiB/sec    7.834 GiB/sec   137.771                               {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 118}
                BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1    3.453 GiB/sec    8.148 GiB/sec   135.957                     {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 119}
                BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100    1.360 GiB/sec    1.780 GiB/sec    30.870                      {'family_index': 13, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49}
                          BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100    1.360 GiB/sec    1.780 GiB/sec    30.861                                {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49}
                  BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0    1.292 GiB/sec    1.662 GiB/sec    28.666                        {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47}
                            BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0    1.304 GiB/sec    1.665 GiB/sec    27.691                                  {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46}
                              BM_ReadBinaryViewColumn/null_probability:99/unique_values:32  959.085 MiB/sec    1.185 GiB/sec    26.568                                     {'family_index': 15, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                 BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99    1.012 GiB/sec    1.210 GiB/sec    19.557                       {'family_index': 13, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36}
                BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1    1.011 GiB/sec    1.187 GiB/sec    17.407                       {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                           BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99    1.024 GiB/sec    1.201 GiB/sec    17.206                                 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36}
                              BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1    1.023 GiB/sec    1.197 GiB/sec    17.016                                     {'family_index': 15, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                                  BM_ReadBinaryColumn/null_probability:99/unique_values:32  541.347 MiB/sec  632.640 MiB/sec    16.864                                         {'family_index': 14, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9}
                            BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1  954.762 MiB/sec    1.084 GiB/sec    16.272                                  {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33}
                  BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1  970.997 MiB/sec    1.100 GiB/sec    15.969                        {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 34}
                                  BM_ReadBinaryColumn/null_probability:99/unique_values:-1  592.605 MiB/sec  666.605 MiB/sec    12.487                                        {'family_index': 14, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10}
                    BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1  587.604 MiB/sec  659.154 MiB/sec    12.177                          {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10}
                              BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1  867.001 MiB/sec  962.427 MiB/sec    11.006                                     {'family_index': 15, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                           BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50  473.040 MiB/sec  522.948 MiB/sec    10.551                                 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17}
                               BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1    1.633 GiB/sec    1.800 GiB/sec    10.197                                      {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5}
                                                              BM_ReadStructOfListColumn/50  466.944 MiB/sec  513.407 MiB/sec     9.951                                                                    {'family_index': 20, 'per_family_instance_index': 2, 'run_name': 'BM_ReadStructOfListColumn/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27}
                BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1  894.649 MiB/sec  976.595 MiB/sec     9.160                       {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                 BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50  479.717 MiB/sec  523.293 MiB/sec     9.084                       {'family_index': 13, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17}
                                  BM_ReadBinaryColumn/null_probability:50/unique_values:-1  613.860 MiB/sec  667.963 MiB/sec     8.814                                         {'family_index': 14, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                 BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1    1.479 GiB/sec    1.608 GiB/sec     8.761                        {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                 BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1    1.628 GiB/sec    1.762 GiB/sec     8.235                        {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5}
                                                               BM_ReadStructOfListColumn/0  760.221 MiB/sec  822.339 MiB/sec     8.171                                                                     {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ReadStructOfListColumn/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47}
                               BM_ReadBinaryViewColumn/null_probability:1/unique_values:32  843.826 MiB/sec  912.397 MiB/sec     8.126                                      {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                              BM_ReadBinaryViewColumn/null_probability:50/unique_values:32  699.538 MiB/sec  755.468 MiB/sec     7.995                                     {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                            BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024    3.724 GiB/sec    4.007 GiB/sec     7.597                                               {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 176027}
                               BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1    1.474 GiB/sec    1.586 GiB/sec     7.591                                      {'family_index': 15, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                                   BM_ReadBinaryColumn/null_probability:0/unique_values:-1    1.114 GiB/sec    1.192 GiB/sec     7.005                                          {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                   BM_ReadBinaryColumn/null_probability:1/unique_values:-1    1.022 GiB/sec    1.091 GiB/sec     6.715                                          {'family_index': 14, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
                     BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1    1.101 GiB/sec    1.174 GiB/sec     6.557                            {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4}
 BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000   18.019 MiB/sec   19.100 MiB/sec     5.997    {'family_index': 33, 'per_family_instance_index': 14, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6295}
                               BM_ReadBinaryViewColumn/null_probability:0/unique_values:32  893.151 MiB/sec  945.900 MiB/sec     5.906                                      {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
 BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000   20.243 MiB/sec   21.404 MiB/sec     5.733    {'family_index': 33, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7257}
                    BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1  620.583 MiB/sec  655.859 MiB/sec     5.684                           {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                   BM_ReadBinaryColumn/null_probability:0/unique_values:32  751.375 MiB/sec  793.728 MiB/sec     5.637                                          {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
                                  BM_ReadBinaryColumn/null_probability:50/unique_values:32  537.693 MiB/sec  567.159 MiB/sec     5.480                                         {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
  BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100   44.112 MiB/sec   46.474 MiB/sec     5.355     {'family_index': 33, 'per_family_instance_index': 6, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15273}
   BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000   20.750 MiB/sec   21.843 MiB/sec     5.265      {'family_index': 30, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7387}
                                                      BM_ReadColumn<false,Int32Type>/-1/10    7.621 GiB/sec    8.019 GiB/sec     5.223                                                            {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumn<false,Int32Type>/-1/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 137}

[ ... snip non-significant changes ... ]

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (4)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                           benchmark        baseline       contender  change %                                                                                                                                                                                             counters
                                BM_ReadListColumn/99   1.452 GiB/sec   1.379 GiB/sec    -5.006                                   {'family_index': 21, 'per_family_instance_index': 3, 'run_name': 'BM_ReadListColumn/99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 129}
BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024 270.542 MiB/sec 256.345 MiB/sec    -5.248 {'family_index': 27, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32060}
          BM_ArrowBinaryPlain/DecodeArrow_Dict/65536 172.371 MiB/sec 162.455 MiB/sec    -5.753             {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 319}
    BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024 189.008 MiB/sec 176.900 MiB/sec    -6.406     {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22292}
```

### Are these changes tested?

By existing tests.

### Are there any user-facing changes?

No.

* GitHub Issue: #47012

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change

The system package for xsimd is too old on Fedora 39, use bundled version instead.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #47037

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…{exac,approximate} (#46385)

### Rationale for this change

`ARROW:average_byte_width:exact` and `ARROW:average_byte_width:approximate` statistics attributes are missing in `arrow::ArrayStatistics`.

### What changes are included in this PR?

Add `average_byte_width` and `is_average_byte_width_exact`  member variables to `arrow::ArrayStatistics`.

### Are these changes tested?
Yes, I run the relevant unit tests
### Are there any user-facing changes?
Yes
* GitHub Issue: #45639

Lead-authored-by: Arash Andishgar <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…ng-point comparison when values share the same memory (#47044)

### Rationale for this change

As discussed [here](#46938 (comment)), this is a minor enhancement to `arrow::ChunkedArray::Equals`.

### What changes are included in this PR?

A minor improvement to `arrow::ChunkedArray::Equals` to handle the case where chunked arrays share the same underlying memory.

### Are these changes tested?

Yes, I ran the relevant unit tests.

### Are there any user-facing changes?

No.

* GitHub Issue: #46938

Authored-by: Arash Andishgar <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
)

### Rationale for this change

OpenSUSE 15.5 ships old GCC (7.5) that doesn't have enough C++17 support.

### What changes are included in this PR?

Use Ubuntu 20.04 that ships GCC 9.3  instead of OpenSUSE 15.5.

Ubuntu 20.04 reached EOL but we can use it for now.

We discussed why we need OpenSUSE 15.5 based job at #45718 (comment) . We have the job because https://arrow.apache.org/docs/developers/cpp/building.html said "gcc 7.1 and higher should be sufficient".

We need require GCC 9 or later with #46813.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #46989

Lead-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

If we use custom gtest main with MSVC, it always reports "SEH exception".

### What changes are included in this PR?

Remove MSVC version check.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #47033

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…ecordBatch` by calling `arrow.recordBatch` with no input arguments (#47060)

### Rationale for this change

Currently, the `arrow.table` construction function will return an empty `arrow.tabular.Table` if no input arguments are passed  to the function. However, `arrow.recordBatch` throws an error in this case. We should consider making `arrow.recordBatch` behave consistently with `arrow.table` in this case.

This should be relatively straightforward to implement. We can just set the input argument `T` to default to `table.empty(0,0)` in the `arguments` block of the `recordBatch` function, in the same way that `arrow.table` does:

https://github.com/apache/arrow/blob/73454b7040fbea3a187c1bfabd7ea02d46ca3c41/matlab/src/matlab/%2Barrow/table.m#L21

### What changes are included in this PR?

Updated the `arrow.recordBatch` function to return an `arrow.tabular.RecordBatch` instance with zero columns and zero rows if called with zero input arguments. Before this change, the `arrow.recordBatch` function would throw an error if called with zero input arguments.

**Example Usage:**
```matlab
>> rb = arrow.recordBatch()

rb = 

  Arrow RecordBatch with 0 rows and 0 columns
```

### Are these changes tested?

Yes. Added a new test case to `tRecordBatch` called `ConvenienceConstructorZeroArguments`.

### Are there any user-facing changes?

Yes. Users can now call `arrow.recordBatch` with zero input arguments.

* GitHub Issue: #38211

Authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>
### Rationale for this change

We must use GPG key ID not GPG key itself for `gpg --local-user`.

### What changes are included in this PR?

Use `ARROW_GPG_KEY_UID`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47061

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

`RELEASE_TARBALL` is registered to `GITHUB_ENV` but isn't defined in this context.

### What changes are included in this PR?

Define `RELEASE_TARBARLL`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47063

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

v0.55.0 is the latest version. v0.39.0 depends on old grpcio (1.59.0) that doesn't provide wheels for Python 3.13.

### What changes are included in this PR?

Update the default Google Cloud Storage Testbench version.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47047

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

We must use `timeout-minutes` not `timeout` for timeout.

### What changes are included in this PR?

Fix key.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47065

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…#47068)

### Rationale for this change

We must use `inputs` not `input` for inputs for workflow dispatch: https://docs.github.com/en/actions/reference/contexts-reference#inputs-context

### What changes are included in this PR?

Fix the context name.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47067

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

We need `needs: target` for jobs that use the `target` job outputs.

### What changes are included in this PR?

Add missing `needs: target`s.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.
* GitHub Issue: #47069

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
)

### Rationale for this change

Apache Rat doesn't like hard links.

### What changes are included in this PR?

Use `tar --hard-dereference`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47071

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…ive (#47076)

### Rationale for this change

The current source archive creation is reproducible when we use the same Git working tree.

But it's not reproducible when we use different Git working trees.

### What changes are included in this PR?

Use the committer date of the target commit instead of the `charp/` mtime in the current Git working tree for `csharp/` in the source archive.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47074

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
… check (#47079)

### Rationale for this change

We need to use `dev/release/utils-create-release-tarball.sh` that exists in the target apache/arrow directory.

### What changes are included in this PR?

Use `dev/release/utils-create-release-tarball.sh` in cloned apache/arrow.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47078

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

Fedora 39 reached EOL on 2024-11-26: https://docs.fedoraproject.org/en-US/releases/eol/

### What changes are included in this PR?

Use Fedora 42 that is the latest release.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47045

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

azure-sdk-for-cpp uses `export(PACKAGE)` https://cmake.org/cmake/help/latest/command/export.html#package . It changes user package registry (`~/.cmake/packages/`) https://cmake.org/cmake/help/latest/manual/cmake-packages.7.html#user-package-registry . It's outside of a build directory. If user package registry is changed, other build may be failed.

### What changes are included in this PR?

Disable `export(PACKAGE)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #47005

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

There are 2 problems on verification of reproducible source archive:

1. CI on macOS isn't prepared correctly
2. Some verification environments may not have required tools 

FYI: We need the following to check reproducible build on macOS:

* Ensure using apache/arrow for `GITHUB_REPOSITORY`
  * `GITHUB_REPOSITORY` is defined automatically on GitHub Actions. Our Crossbow based verification job has `GITHUB_REPOSITORY=ursacomputing/crossbow` by default.
* GNU tar
* GNU gzip

### What changes are included in this PR?

For the problem1:
* Set `GITHUB_REPOSITORY` explicitly
* Install GNU gzip (GNU tar is already installed)

For the problem2:
* Add `TEST_SOURCE_REPRODUCIBLE` that is `0` by default
* Set `TEST_SOURCE_REPRODUCIBLE=1` on CI
* At least one PMC member must set `TEST_SOURCE_REPRODUCIBLE=1` on release verification

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47081

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…on (#47093)

### Rationale for this change

There are some problems in APT/Yum previous version verification:

* There are some typos
* Can't reuse `dev/release/verify-release-candidate.sh` for the previous version verification 

### What changes are included in this PR?

* Fix typos
* Reuse `dev/release/verify-release-candidate.sh` for the previous version verification
* Ignore the previous version verification result for now
  * We may revisit this once we can fix the current problems. See the added comments for details.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47092

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…#47098)

### Rationale for this change
Fixing #41110.

### What changes are included in this PR?
Handle empty stream in `ArrowStreamReaderImplementation`. Similar changes have *not* been made to `ArrowMemoryReaderImplementation` or `ArrowFileReaderImplementation`.

### Are these changes tested?
Two basic unit tests have been created to validate the new behavior. This might not be sufficient to cover all cases where an empty stream should be handled without an exception occurring.

Original change by @ voidstar69; this takes his change and applies the PR feedback to it.

* GitHub Issue: #41110

Lead-authored-by: voidstar69 <[email protected]>
Co-authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
Performed the following updates:
- Updated BenchmarkDotNet from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props
- Updated BenchmarkDotNet.Diagnostics.Windows from 0.14.0 to 0.15.2 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj, /csharp/test/Directory.Build.props
- Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj
- Updated Google.Protobuf from 3.30.2 to 3.31.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props
- Updated Grpc.AspNetCore from 2.67.0 to 2.71.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj, /csharp/test/Directory.Build.props
- Updated Grpc.Tools from 2.71.0 to 2.72.0 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj, /csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj
- Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.0 in /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
- Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props
- Updated Microsoft.NET.Test.Sdk from 17.13.0 to 17.14.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj, /csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj, /csharp/test/Directory.Build.props
- Updated xunit.runner.visualstudio from 3.1.0 to 3.1.1 in /csharp/Directory.Build.props, /csharp/Directory.Build.targets, /csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj, /csharp/test/Directory.Build.props

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@ dependabot rebase` will rebase this PR
- `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@ dependabot merge` will merge this PR after your CI passes on it
- `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@ dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@ dependabot reopen` will reopen this PR if it is closed
- `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@ dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

</details>

Lead-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: Curt Hagenlocher <[email protected]>
… classes (#47059)

### Rationale for this change

As a follow up to #38531 (see #38531 (comment)), we should consider adding a `validate` method to all `arrow.array.Array` classes, which would allow users to explicitly validate the contents of an `arrow.array.Array` after it is created.

### What changes are included in this PR?

Added `validate()` as a method to `arrow.array.Array`. This method has one name-value pair which is called `ValidationMode`. `ValidationMode` can either be specified as `"minimal"` or `"full"`. By default, `ValidationMode="minimal"`.

**Example Usage:**

```matlab
>> offsets = arrow.array(int32([0 1 0]));
>> values = arrow.array(1:3);
>> array = arrow.array.ListArray.fromArrays(offsets, values);
>> array.validate(ValidationMode="full")
>> array.validate(ValidationMode="full")
Error using .  (line 63)
Offset invariant failure: non-monotonic offset at slot 2: 0 < 1

Error in arrow.array.Array/validate (line 68)
             obj.Proxy.validate(struct(ValidationMode=uint8(opts.ValidationMode)));
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

```

### Are these changes tested?

Yes. Added a MATLAB test class called `tValidateArray.m`.

### Are there any user-facing changes?

Yes. There is a new public method that is accessible via any subclass of `arrow.array.Array`. 

* GitHub Issue: #38532

Lead-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Sarah Gilmore <[email protected]>
Co-authored-by: Kevin Gurney <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>
…when reaching page size limit (#47032)

### Rationale for this change

Ensures Parquet pages are written when the buffered data reaches the configured page size, while also ensuring pages are only split on record boundaries when required.

Without this fix, page sizes can grow unbounded until the row group is closed.

### What changes are included in this PR?

Fixes off-by-one error in logic to control when pages can be written.

### Are these changes tested?

Yes, added a new unit test.

### Are there any user-facing changes?

**This PR contains a "Critical Fix".**

This bug could cause a crash when writing a large number of rows of a repeated column and reaching a page size > max int32.
* GitHub Issue: #47027

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: Adam Reeve <[email protected]>
…on_arrow.sh (#47089)

### Rationale for this change

This is the sub issue #44748.

* SC2046: Quote this to prevent word splitting.
* SC2086: Double quote to prevent globbing and word splitting.
* SC2102: Ranges can only match single chars (mentioned due to duplicates).
* SC2223: This default assignment may cause DoS due to globbing. Quote it.

```
ci/scripts/integration_arrow.sh

In ci/scripts/integration_arrow.sh line 27:
: ${ARROW_INTEGRATION_CPP:=ON}
  ^--------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.

In ci/scripts/integration_arrow.sh line 28:
: ${ARROW_INTEGRATION_CSHARP:=ON}
  ^-----------------------------^ SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.

In ci/scripts/integration_arrow.sh line 30:
: ${ARCHERY_INTEGRATION_TARGET_IMPLEMENTATIONS:=cpp,csharp}
  ^-- SC2223 (info): This default assignment may cause DoS due to globbing. Quote it.

In ci/scripts/integration_arrow.sh line 33:
. ${arrow_dir}/ci/scripts/util_log.sh
  ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
. "${arrow_dir}"/ci/scripts/util_log.sh

In ci/scripts/integration_arrow.sh line 36:
pip install -e $arrow_dir/dev/archery[integration]
               ^--------^ SC2086 (info): Double quote to prevent globbing and word splitting.
                                     ^-----------^ SC2102 (info): Ranges can only match single chars (mentioned due to duplicates).

Did you mean:
pip install -e "$arrow_dir"/dev/archery[integration]

In ci/scripts/integration_arrow.sh line 66:
    --with-cpp=$([ "$ARROW_INTEGRATION_CPP" == "ON" ] && echo "1" || echo "0") \
               ^-- SC2046 (warning): Quote this to prevent word splitting.

In ci/scripts/integration_arrow.sh line 67:
    --with-csharp=$([ "$ARROW_INTEGRATION_CSHARP" == "ON" ] && echo "1" || echo "0") \
                  ^-- SC2046 (warning): Quote this to prevent word splitting.

In ci/scripts/integration_arrow.sh line 68:
    --gold-dirs=$gold_dir/0.14.1 \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/0.14.1 \

In ci/scripts/integration_arrow.sh line 69:
    --gold-dirs=$gold_dir/0.17.1 \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/0.17.1 \

In ci/scripts/integration_arrow.sh line 70:
    --gold-dirs=$gold_dir/1.0.0-bigendian \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/1.0.0-bigendian \

In ci/scripts/integration_arrow.sh line 71:
    --gold-dirs=$gold_dir/1.0.0-littleendian \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/1.0.0-littleendian \

In ci/scripts/integration_arrow.sh line 72:
    --gold-dirs=$gold_dir/2.0.0-compression \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/2.0.0-compression \

In ci/scripts/integration_arrow.sh line 73:
    --gold-dirs=$gold_dir/4.0.0-shareddict \
                ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean:
    --gold-dirs="$gold_dir"/4.0.0-shareddict \

For more information:
  https://www.shellcheck.net/wiki/SC2046 -- Quote this to prevent word splitt...
  https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
  https://www.shellcheck.net/wiki/SC2102 -- Ranges can only match single char...

```

### What changes are included in this PR?

Quote variables.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47088

Authored-by: Hiroyuki Sato <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

See #46629.

### What changes are included in this PR?

This PR updates the `DatasetFactory.inspect` method so that it accepts new `promote_options` and `fragments` parameters. Since we parse string into a `MergeOptions` struct in three different places, this PR defines the helper function `_parse_field_merge_options`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This adds optional parameters to a public method. It changes the default behavior from checking one fragment to checking all fragments (the old documentation said it inspected "all data fragments" even though it didn't).

* GitHub Issue: #46629

Lead-authored-by: Hadrian Reppas <[email protected]>
Co-authored-by: Hadrian Reppas <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change
Cryptographic keys must be kept private. Using the new `arrow::util::SecureString` is vital for storing secrets securely.

### What changes are included in this PR?
Uses the `arrow::util::SecureString` introduced in #46626 for cryptographic keys throughout Parquet encryption.

### Are these changes tested?
Unit tests.

### Are there any user-facing changes?
APIs that hand over secrets to Arrow require the secret to be encapsulated in a `SecureString`.

**This PR includes breaking changes to public APIs.**

TODO:
- provide instructions for migration

Supersedes  #12890.

* GitHub Issue: #31603

Lead-authored-by: Enrico Minack <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
thisisnic and others added 23 commits October 6, 2025 13:07
…ax (#47622)

### Rationale for this change

Don't need base pipe

### What changes are included in this PR?

Update package to use native pipe

### Are these changes tested?

Sure

### Are there any user-facing changes?

Nah
* GitHub Issue: #47106

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Bryce Mecum <[email protected]>
… error (#47660)

### Rationale for this change

Fixes issue at #47659

### What changes are included in this PR?

Include add gmock as a shared private link library to `arrow_flight_testing`

### Are these changes tested?

Build for `arrow_flight_testing` succeeds on my Windows environment 

### Are there any user-facing changes?
No
* GitHub Issue: #47659

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…47454)

### Rationale for this change
Currently, the `maps_as_pydicts` parameter to `MapScalar.as_py` does not work on nested maps. See below:

```
import pyarrow as pa

t = pa.struct([pa.field("x", pa.map_(pa.string(), pa.map_(pa.string(), pa.int8())))])
v = {"x": {"a": {"1": 1}}}
s = pa.scalar(v, type=t)
print(s.as_py(maps_as_pydicts="strict"))

# {'x': {'a': [('1', 1)]}}
```

In this ^ case, I'd want to get the value: `{'x': {'a': {'1': 1}}}`, such that round trips would work as expected.

### What changes are included in this PR?
Begin to apply the `maps_as_pydicts` to nested values in map types as well, update relevant test.
 
### Are these changes tested?
Yes

### Are there any user-facing changes?
Yes, just a user-facing fix.

* GitHub Issue: #47380

Lead-authored-by: Johanna <[email protected]>
Co-authored-by: zzkv <[email protected]>
Co-authored-by: Johanna <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…repository (#47600)

### Rationale for this change

There are several things that make this change wanted. We want to move some CI jobs from `ursacomputing/crossbow` to `apache/arrow`. Moving the Linux Packaging jobs will allow us to automate some release tasks and potentially (if we are able to make reproducible builds for linux packaging work) add automated signing to them avoiding having to require a PMC signature for the Linux packaging artifacts.

### What changes are included in this PR?

- Move `check_labels` and `report_ci` jobs to independent reusable workflows.
- Update `cpp_extra` to use those.
- Create new `linux_packaging.yml` workflow replicating work that was done on crossbow. Integrate that workflow with `check_labels` and `report_ci`
- Update release binary submit and binary download to run workflow when tag is pushed and download the artifacts from the release instead of from the crossbow repository.

### Are these changes tested?

Some via CI on fork and some manual testing.

### Are there any user-facing changes?

No

* GitHub Issue: #47582

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…e.g. "+04:30") (#12865)

ARROW-14477: #30036

Currently timestamp arrays have unit `timestamp(unit, zone name)`. This would add "offset timezones" where timestamp array would also support units like `timestamp(unit, "+/-HH:MM")`.
* GitHub Issue: #30036

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Rok <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>
…rted image (#47730)

### Rationale for this change

Old image fails due to debian update

### What changes are included in this PR?

Use newer image

### Are these changes tested?

Will submit crossbow run

### Are there any user-facing changes?

No
* GitHub Issue: #47705

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
### Rationale for this change

#45964 changed paths of pre-built Apache Arrow C++ binaries for R. But we forgot to update the nightly upload job.

### What changes are included in this PR?

Update paths in the nightly upload job.

### Are these changes tested?

No...

### Are there any user-facing changes?

Yes.
* GitHub Issue: #47704

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
…47743)

### Rationale for this change

Valgrind would report memory leaks induced by protobuf initialization on library load, for example:
```
==14628== 414 bytes in 16 blocks are possibly lost in loss record 22 of 26
==14628==    at 0x4914EFF: operator new(unsigned long) (vg_replace_malloc.c:487)
==14628==    by 0x8D0B6CA: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.0] (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0)
==14628==    by 0x8D33E62: google::protobuf::DescriptorPool::Tables::Tables() (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0)
==14628==    by 0x8D340E2: google::protobuf::DescriptorPool::DescriptorPool(google::protobuf::DescriptorDatabase*, google::protobuf::DescriptorPool::ErrorCollector*) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0)
==14628==    by 0x8D341A2: google::protobuf::DescriptorPool::internal_generated_pool() (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0)
==14628==    by 0x8D34277: google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0)
==14628==    by 0x8D9C56F: google::protobuf::internal::AddDescriptorsRunner::AddDescriptorsRunner(google::protobuf::internal::DescriptorTable const*) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0)
==14628==    by 0x40D147D: call_init.part.0 (dl-init.c:70)
==14628==    by 0x40D1567: call_init (dl-init.c:33)
==14628==    by 0x40D1567: _dl_init (dl-init.c:117)
==14628==    by 0x40EB2C9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
```

This was triggered by the `libprotobuf` upgrade on conda-forge from 3.21.12 to 4.25.3.

### What changes are included in this PR?

Add a Valgrind suppression for these leak reports, as there is probably not much we can do about them.

### Are these changes tested?

Yes, by existing CI test.

### Are there any user-facing changes?

No.

* GitHub Issue: #47742

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…Parquet data (#47741)

### Rationale for this change

Fix issues found by OSS-Fuzz when invalid Parquet data is fed to the Parquet reader:
* https://issues.oss-fuzz.com/issues/447262173
* https://issues.oss-fuzz.com/issues/447480433
* https://issues.oss-fuzz.com/issues/447490896
* https://issues.oss-fuzz.com/issues/447693724
* https://issues.oss-fuzz.com/issues/447693728
* https://issues.oss-fuzz.com/issues/449498800

### Are these changes tested?

Yes, using the updated fuzz regression files from apache/arrow-testing#115

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #47740

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change
Mimalloc default generates LSE atomic instructions only work on armv8.1. This causes illegal instruction on armv8.0 platforms like Raspberry4. This PR sets mimalloc build flag -DMI_NO_OPT_ARCH=ON to disable LSE instruction.
Please note even with flag set, compiler and libc will replace the atmoic call with an ifunc that matches hardware best at runtime. That means LSE is used only if the running platform supports it.

### What changes are included in this PR?
Force mimalloc build flag -DMI_NO_OPT_ARCH=ON.

### Are these changes tested?
Manually tested.

### Are there any user-facing changes?
No.

**This PR contains a "Critical Fix".**
Fixes crashes on Armv8.0 platform.
* GitHub Issue: #47229

Lead-authored-by: Yibo Cai <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change

According to microsoft/mimalloc#1073 , mimalloc v3 is preferred over v2 for production usage.

There are reports of higher than expected memory consumption with mimalloc 2.2.x, notably when reading Parquet data (example: GH-47266).

### What changes are included in this PR?

Bump to mimalloc 3.1.5, which is the latest mimalloc 3.1.x release as of this writing.

### Are these changes tested?

Yes, by existing tests and CI.

### Are there any user-facing changes?

Hopefully not, besides a potential reduction in memory usage due to improvements in mimalloc v3.

* GitHub Issue: #47588

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change

There are link errors with build options for JNI on macOS.

### What changes are included in this PR?

`ARROW_BUNDLED_STATIC_LIBS` has CMake target names defined in Apache Arrow not `find_package()`-ed target names. So we should use `aws-c-common` not `AWS::aws-c-common`.

Recent aws-c-common or something use the Network framework. So add `Network` to `Arrow::arrow_bundled_dependencies` dependencies.

Don't use `compute/kernels/temporal_internal.cc` in `libarrow.dylib` and `libarrow_compute.dylib` to avoid duplicated symbols error.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #47748

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
### Rationale for this change

This is for preventing to break Apache Arrow Java JNI use case on Linux.

### What changes are included in this PR?

* Add a CI job that uses build options for JNI use case
* Install more packages in manylinux image that is also used by JNI build 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47632

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

`archery docker push` doesn't support custom Docker registry such as ghcr.io.

### What changes are included in this PR?

Parse Docker image tag and specify Docker registry name to `docker push` if it's specified in the tag. 

Docker image tag format: `[HOST[:PORT]/]NAMESPACE/REPOSITORY[:TAG]`

See also: https://docs.docker.com/reference/cli/docker/image/tag/#description

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #47795

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…47616)

### Rationale for this change

Python 3.14 is currently in a prerelease status and is expected to have a final release in October this year (https://peps.python.org/pep-0745/).

We should ensure we are fully ready to support Python 3.14 for the PyArrow 22 release.

### What changes are included in this PR?

This PR  updates wheels for Python 3.14.

### Are these changes tested?

Tested in the CI and with extended builds.

### Are there any user-facing changes?

No, but users will be able to use PyArrow with Python 3.14.

* GitHub Issue: #47438

---

Todo:

- Update the image revision name in `.env`
- Add 3.14 conda build ([arrow/dev/tasks/tasks.yml](https://github.com/apache/arrow/blob/d803afcc43f5d132506318fd9e162d33b2c3d4cd/dev/tasks/tasks.yml#L809)) when conda-forge/pyarrow-feedstock#156 is merged 

Follow-ups:

- #47437

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
…47804)

Found by OSS-Fuzz, should fix https://issues.oss-fuzz.com/issues/451150486.

Ensure RLE run is within bounds before reading it.

Yes, by fuzz regression test in ASAN/UBSAN build.

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #47803

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change

Summarise changes for release

### What changes are included in this PR?

Update NEWS file

### Are these changes tested?

No

### Are there any user-facing changes?

No
* GitHub Issue: #47738

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…l patch from conda (#47810)

### Rationale for this change

Our verify-rc-source Windows job is failing due to patch not being available for Windows.

### What changes are included in this PR?

Move patch requirement from `conda_env_cpp.txt` to `conda_env_unix.txt`

### Are these changes tested?

Yes via CI and archery.

### Are there any user-facing changes?

No

* GitHub Issue: #47809

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
… release branch push (#47826)

### Rationale for this change

We require the Linux package jobs to be triggered on RC tag creation. For example for 22.0.0, we currently push the tag `apache-arrow-22.0.0-rc0` and the release branch `release-22.0.0-rc0`. Those events are triggering builds over the same commit and the tag event gets cancelled due to a "high priority task" triggering the same jobs. This causes jobs to fail on the branch because the ARROW_VERSION is not generated. If we manually re-trigger the jobs on the tag they are successful.

### What changes are included in this PR?

Remove the `release-*` branches from triggering the event to allow only the tag to run the jobs so they don't get cancelled.

### Are these changes tested?

No

### Are there any user-facing changes?

No

* GitHub Issue: #47819

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…ign with the variant spec (#47835)

### Rationale for this change
According to the [Variant specification](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md), the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.
<img width="624" height="185" alt="image" src="https://github.com/user-attachments/assets/b0f1deb9-0301-4b94-a472-17fd9cc0df5d" />

### What changes are included in this PR?
The change includes defaulting the specification version to 1.
### Are these changes tested?
The change is covered by unit test.
### Are there any user-facing changes?
The Parquet files produced the variant logical type annotation `VARIANT(1)`.

```
Schema:
message schema {
  optional group V (VARIANT(1)) = 1 {
    required binary metadata;
    required binary value;
  }
}
```

* GitHub Issue: #47838

Lead-authored-by: Aihua <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Copy link
Collaborator

@argmarco-tkd argmarco-tkd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. - thanks for this!

@sofia-tekdatum sofia-tekdatum merged commit 7c19398 into protegrity:main Nov 19, 2025
31 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Git merging > Sync protegrity/arrow main branch to latest Arrow release/stable version