GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090

wgtmac · 2025-07-12T16:50:02Z

Rationale for this change

Currently only page size is limited. We need to limit number of rows per page too.

What changes are included in this PR?

Add parquet::WriterProperties::max_rows_per_page(int64_t max_rows) to limit number of rows per data page.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, users are allowed to set this config value.

GitHub Issue: [C++][Parquet] Add setting to limit the number of rows written per page #47030

github-actions · 2025-07-12T16:50:24Z

⚠️ GitHub issue #40730 has been automatically assigned in GitHub to PR creator.

wgtmac · 2025-07-12T16:55:07Z

Please check if this is the right direction. @pitrou @mapleFU @adamreeve

BTW, some existing test cases will break if I switch the default value to limit 20,000 rows per page. I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.

github-actions · 2025-07-12T16:56:34Z

⚠️ GitHub issue #47030 has been automatically assigned in GitHub to PR creator.

adamreeve

This approach looks correct to me thanks @wgtmac.

I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.

A default of 100k would still change behaviour though, and most of the time result in smaller pages being written. I think it probably makes sense to use 20k to align with Java and Rust, but it could be done as a separate PR if there are a lot of test changes needed.

I don't imagine this should break any downstream code, but we'd definitely want to call it out in the release notes as something for users to be aware of.

cpp/src/parquet/properties.h

adamreeve · 2025-07-14T01:40:43Z

I should also mention that #47032 touches the same part of the code. It looks like the fix from that PR can easily be ported to this new code though.

cpp/src/parquet/properties.h

cpp/src/parquet/column_writer.cc

pitrou · 2025-07-15T13:47:16Z

Should this PR be set to draft until it's ready?

wgtmac · 2025-07-19T15:50:15Z

I think this is ready for review. @pitrou

pitrou · 2025-07-20T15:00:52Z

Thanks. I'm on vacation so I'm going to be a bit slow, sorry!

wgtmac · 2025-07-25T16:03:04Z

IIUC, this PR may slightly affect CDC. Let me know if you have any feedback. @kszucs

kszucs · 2025-07-27T18:07:44Z

It depends on the value of the length limit. The current size limit is 1MB calculated after encoding while CDC default size range calculated on the logical values before encoding is between 256KB and 1MB. CDC chunking is applied before the size limit check, so using the default parameters should trigger a data page write before the size limit check. If the size limit is set to a smaller value, then there will be two data pages, a larger one cut at the size limit, and a smaller one cut at the CDC boundary because the CDC hash is not being reset if size limit is triggered. So basically there are two cases:

a) size limits are bigger than the cdc limits, then the pages are cut earlier than the size limits would happen:
page1 [cdc-cut] page2 [cdc-cut] page3
b) size limits are smaller than the cdc limits, then the previous cdc cuts will happen nonetheless, but the pages are going to be split further according to the size limits
page1 [cdc-cut] page2/a [size-cut] page2/b [cdc-cut] page3

So in theory it shouldn't affect the CDC's effectiveness. We can also check this before merging using https://github.com/huggingface/dataset-dedupe-estimator

kszucs · 2025-07-27T18:13:14Z

cpp/src/parquet/properties.h

@@ -155,6 +155,7 @@ class PARQUET_EXPORT ReaderProperties {
 ReaderProperties PARQUET_EXPORT default_reader_properties();

 static constexpr int64_t kDefaultDataPageSize = 1024 * 1024;
+static constexpr int64_t kDefaultMaxRowsPerPage = 20'000;


Maybe we should have this feature as opt-in? Otherwise should we choose a bigger value to have the data page size triggered first?

Given the parquet type sizes we could end up much smaller data pages (even before encoding) than 1MB which could be unexpected to the users and also increase the overall metadata size:

- BOOLEAN: 1 bit boolean - INT32: 32 bit signed ints - INT64: 64 bit signed ints - INT96: 96 bit signed ints (deprecated; only used by legacy implementations) - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit floating point values - BYTE_ARRAY: arbitrarily long byte arrays - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays

Right, defaulting to 20000 will definitely create smaller pages of numeric types. 20000 is also used by parquet-java and arrow-rs and considered to be a good value per the discussion at https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp

The goal is precisely to make the average data page size much smaller than 1MB, which is considered too large as a compression/encoding unit. 1MB is an additional limit in case individual values are large.

@pitrou Do you have any comment on the code change?

I see. Theoretically this shouldn't affect the CDC effectiveness, on the contrary, having smaller pages will likely improve the deduplication ratio. Although the default CDC options were chosen to approach the 1MB page size limit, so I need to reconsider the defaults.

Either way, I'm checking whether this change interferes with CDC or not, theoretically it shouldn't.

cpp/src/parquet/column_writer.cc

cpp/src/parquet/column_writer_test.cc

wgtmac · 2025-09-18T09:13:59Z

@pitrou Gentle ping :)

cpp/src/parquet/column_writer.cc

cpp/src/parquet/column_writer_test.cc

…s written per page

wgtmac · 2025-10-13T02:41:09Z

I've split DoInBatches into repeated and non-repeated ones. Let me know what you think, thanks! @pitrou

pitrou

Thanks a lot @wgtmac , I find it much easier to grok now. I've posted one suggestion, but otherwise LGTM.

pitrou · 2025-10-20T15:49:55Z

cpp/src/parquet/column_writer.cc

+    ARROW_DCHECK_LE(offset, end_offset);
+    ARROW_DCHECK_LE(check_page_limit_end_offset, end_offset);
+
    if (end_offset < num_levels) {


I presume this nesting of ifs could perhaps be rewritten like this. I don't know if it's more readable but it it shorter and simpler.

if (check_page_limit_end_offset >= 0) { // At least one record boundary is included in this batch. // It is a good chance to check the page limit. action(offset, check_page_limit_end_offset - offset, /*check_page_limit=*/true); offset = check_page_limit_end_offset; } if (end_offset > offset) { // The is the last chunk of batch, and we do not know whether end_offset is a // record boundary so we cannot check page limit if pages cannot change on // record boundaries. ARROW_DCHECK_EQ(end_offset, num_levels); action(offset, end_offset - offset, /*check_page_limit=*/!pages_change_on_record_boundaries); }

Yes, I agree that this is a nice improvement. Thanks!

pitrou

+1, excellent. Thank you @wgtmac !

mapleFU

Just don't understand would it write column with rep-levels slower

mapleFU · 2025-10-29T13:46:44Z

cpp/src/parquet/column_writer.cc

+    // Iterate rep_levels to find the shortest sequence that ends before a record
+    // boundary (i.e. rep_levels == 0) with a size no less than max_batch_size
+    for (int64_t i = offset; i < num_levels; ++i) {


Why not do backward scan like previous algorithm acquiring last_record_begin_offset? To get page_buffered_rows?

Then all levels must be checked, otherwise we can't tell how many records in this batch from the beginning.

Would:

Existing benchmark shows an extract std::count faster?

If it's slower and remaining count greater than batch_size, can we avoid checking?

I don't think it's worth arguing about this. I doubt that a simple loop on levels will be slower than encoding them using RLE-bit-packed encoding, or encoding the values, or compressing them.

(and why would std::count be faster?)

Existing benchmark shows an extract std::count faster?

I don't quite understand this. Did you mean to use std::count to do a quick pass? I think it is already O(N) and requires the 2nd pass to delimit records.

If it's slower and remaining count greater than batch_size, can we avoid checking?

We still need to check record boundary, at least in the reverse direction as before.

I don't know would cpu do count job faster ( since it's simpler for cpu without branching and fast to vectorize), it depends on benchmarking

wgtmac · 2025-10-29T14:36:13Z

Just don't understand would it write column with rep-levels slower

Theoretically it might be the case. However I'm on vacation so I cannot have enough time to benchmark it for a concrete number.

pitrou · 2025-10-29T15:12:51Z

I've ran the Parquet writing benchmarks locally and I didn't see any regression when writing REPEATED columns.

mapleFU · 2025-10-29T15:16:34Z

I've ran the Parquet writing benchmarks locally and I didn't see any regression when writing REPEATED columns.

Ah, so this is ok to me now

pitrou · 2025-10-29T15:39:00Z

This is a welcome improvement, thanks a lot @wgtmac !

conbench-apache-arrow · 2025-10-30T00:04:36Z

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit b2190db.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

raulcd · 2025-11-11T13:55:55Z

I've created a follow-up issue to expose this on the Python bindings here:

[Python][Parquet] Expose new WriterProperties::max_rows_per_page to Python bindings #48096

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jul 12, 2025

wgtmac changed the title ~~GH-40730: [C++][Parquet] Add setting to limit the number of rows written per page~~ GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page Jul 12, 2025

adamreeve reviewed Jul 14, 2025

View reviewed changes

cpp/src/parquet/properties.h Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 14, 2025

wgtmac force-pushed the GH-47030 branch from fa5fbf0 to 984c97b Compare July 15, 2025 05:42

mapleFU self-requested a review July 15, 2025 07:42

mapleFU reviewed Jul 15, 2025

View reviewed changes

cpp/src/parquet/properties.h Show resolved Hide resolved

cpp/src/parquet/properties.h Outdated Show resolved Hide resolved

cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved

wgtmac force-pushed the GH-47030 branch from 984c97b to 93d1ac2 Compare July 19, 2025 15:47

kszucs reviewed Jul 27, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 27, 2025

pitrou requested changes Aug 25, 2025

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 27, 2025

wgtmac force-pushed the GH-47030 branch from d145373 to 354bbbb Compare August 27, 2025 15:38

wgtmac requested a review from pitrou August 27, 2025 15:45

HuaHuaY reviewed Sep 1, 2025

View reviewed changes

cpp/src/parquet/column_writer_test.cc Show resolved Hide resolved

HuaHuaY approved these changes Sep 3, 2025

View reviewed changes

kszucs self-requested a review September 3, 2025 12:44

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 3, 2025

pitrou requested changes Oct 9, 2025

View reviewed changes

cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved

cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved

cpp/src/parquet/column_writer_test.cc Outdated Show resolved Hide resolved

wgtmac added 5 commits October 12, 2025 17:26

apacheGH-40730: [C++][Parquet] Add setting to limit the number of row…

1a6c3d1

…s written per page

add and fix test

0bc0282

address comment

2abbfd2

modify repeated test case

5a9c12c

rename to max_batch_size

27833c4

wgtmac force-pushed the GH-47030 branch from bab7bd3 to 27833c4 Compare October 12, 2025 09:35

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 12, 2025

split DoInBatches

547c4fb

pitrou reviewed Oct 20, 2025

View reviewed changes

address feedback

a41572f

wgtmac requested a review from pitrou October 24, 2025 02:05

pitrou approved these changes Oct 29, 2025

View reviewed changes

mapleFU reviewed Oct 29, 2025

View reviewed changes

pitrou merged commit b2190db into apache:main Oct 29, 2025
48 of 49 checks passed

pitrou removed the awaiting change review Awaiting change review label Oct 29, 2025

pitrou mentioned this pull request Oct 29, 2025

[C++][Parquet] Add setting to limit the number of rows written per page #47030

Closed

GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090

GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090

Conversation

wgtmac commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jul 12, 2025

Uh oh!

wgtmac commented Jul 12, 2025

Uh oh!

github-actions bot commented Jul 12, 2025

Uh oh!

adamreeve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adamreeve commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Jul 15, 2025

Uh oh!

wgtmac commented Jul 19, 2025

Uh oh!

pitrou commented Jul 20, 2025

Uh oh!

wgtmac commented Jul 25, 2025

Uh oh!

kszucs commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Oct 13, 2025

Uh oh!

pitrou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Jul 12, 2025 •

edited

Loading

kszucs commented Jul 27, 2025 •

edited

Loading

pitrou left a comment •

edited

Loading

mapleFU Oct 29, 2025 •

edited

Loading