Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Jul 12, 2025

Rationale for this change

Currently only page size is limited. We need to limit number of rows per page too.

What changes are included in this PR?

Add parquet::WriterProperties::max_rows_per_page(int64_t max_rows) to limit number of rows per data page.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, users are allowed to set this config value.

@github-actions
Copy link

⚠️ GitHub issue #40730 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Jul 12, 2025

Please check if this is the right direction. @pitrou @mapleFU @adamreeve

BTW, some existing test cases will break if I switch the default value to limit 20,000 rows per page. I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.

@wgtmac wgtmac changed the title GH-40730: [C++][Parquet] Add setting to limit the number of rows written per page GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page Jul 12, 2025
@github-actions
Copy link

⚠️ GitHub issue #47030 has been automatically assigned in GitHub to PR creator.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks correct to me thanks @wgtmac.

I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.

A default of 100k would still change behaviour though, and most of the time result in smaller pages being written. I think it probably makes sense to use 20k to align with Java and Rust, but it could be done as a separate PR if there are a lot of test changes needed.

I don't imagine this should break any downstream code, but we'd definitely want to call it out in the release notes as something for users to be aware of.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 14, 2025
@adamreeve
Copy link
Contributor

I should also mention that #47032 touches the same part of the code. It looks like the fix from that PR can easily be ported to this new code though.

@mapleFU mapleFU self-requested a review July 15, 2025 07:42
@pitrou
Copy link
Member

pitrou commented Jul 15, 2025

Should this PR be set to draft until it's ready?

@wgtmac
Copy link
Member Author

wgtmac commented Jul 19, 2025

I think this is ready for review. @pitrou

@pitrou
Copy link
Member

pitrou commented Jul 20, 2025

Thanks. I'm on vacation so I'm going to be a bit slow, sorry!

@wgtmac
Copy link
Member Author

wgtmac commented Jul 25, 2025

IIUC, this PR may slightly affect CDC. Let me know if you have any feedback. @kszucs

@kszucs
Copy link
Member

kszucs commented Jul 27, 2025

It depends on the value of the length limit. The current size limit is 1MB calculated after encoding while CDC default size range calculated on the logical values before encoding is between 256KB and 1MB. CDC chunking is applied before the size limit check, so using the default parameters should trigger a data page write before the size limit check. If the size limit is set to a smaller value, then there will be two data pages, a larger one cut at the size limit, and a smaller one cut at the CDC boundary because the CDC hash is not being reset if size limit is triggered. So basically there are two cases:

a) size limits are bigger than the cdc limits, then the pages are cut earlier than the size limits would happen:
page1 [cdc-cut] page2 [cdc-cut] page3
b) size limits are smaller than the cdc limits, then the previous cdc cuts will happen nonetheless, but the pages are going to be split further according to the size limits
page1 [cdc-cut] page2/a [size-cut] page2/b [cdc-cut] page3

So in theory it shouldn't affect the CDC's effectiveness. We can also check this before merging using https://github.com/huggingface/dataset-dedupe-estimator

@@ -155,6 +155,7 @@ class PARQUET_EXPORT ReaderProperties {
ReaderProperties PARQUET_EXPORT default_reader_properties();

static constexpr int64_t kDefaultDataPageSize = 1024 * 1024;
static constexpr int64_t kDefaultMaxRowsPerPage = 20'000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have this feature as opt-in? Otherwise should we choose a bigger value to have the data page size triggered first?

Given the parquet type sizes we could end up much smaller data pages (even before encoding) than 1MB which could be unexpected to the users and also increase the overall metadata size:

  - BOOLEAN: 1 bit boolean
  - INT32: 32 bit signed ints
  - INT64: 64 bit signed ints
  - INT96: 96 bit signed ints (deprecated; only used by legacy implementations)
  - FLOAT: IEEE 32-bit floating point values
  - DOUBLE: IEEE 64-bit floating point values
  - BYTE_ARRAY: arbitrarily long byte arrays
  - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, defaulting to 20000 will definitely create smaller pages of numeric types. 20000 is also used by parquet-java and arrow-rs and considered to be a good value per the discussion at https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is precisely to make the average data page size much smaller than 1MB, which is considered too large as a compression/encoding unit. 1MB is an additional limit in case individual values are large.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Do you have any comment on the code change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Theoretically this shouldn't affect the CDC effectiveness, on the contrary, having smaller pages will likely improve the deduplication ratio. Although the default CDC options were chosen to approach the 1MB page size limit, so I need to reconsider the defaults.

Either way, I'm checking whether this change interferes with CDC or not, theoretically it shouldn't.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 27, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 27, 2025
@wgtmac wgtmac requested a review from pitrou August 27, 2025 15:45
@kszucs kszucs self-requested a review September 3, 2025 12:44
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 3, 2025
@wgtmac
Copy link
Member Author

wgtmac commented Sep 18, 2025

@pitrou Gentle ping :)

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 12, 2025
@wgtmac
Copy link
Member Author

wgtmac commented Oct 13, 2025

I've split DoInBatches into repeated and non-repeated ones. Let me know what you think, thanks! @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @wgtmac , I find it much easier to grok now. I've posted one suggestion, but otherwise LGTM.

ARROW_DCHECK_LE(offset, end_offset);
ARROW_DCHECK_LE(check_page_limit_end_offset, end_offset);

if (end_offset < num_levels) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this nesting of ifs could perhaps be rewritten like this. I don't know if it's more readable but it it shorter and simpler.

    if (check_page_limit_end_offset >= 0) {
      // At least one record boundary is included in this batch.
      // It is a good chance to check the page limit.
      action(offset, check_page_limit_end_offset - offset, /*check_page_limit=*/true);
      offset = check_page_limit_end_offset;
    }
    if (end_offset > offset) {
      // The is the last chunk of batch, and we do not know whether end_offset is a
      // record boundary so we cannot check page limit if pages cannot change on
      // record boundaries.
      ARROW_DCHECK_EQ(end_offset, num_levels);
      action(offset, end_offset - offset,
             /*check_page_limit=*/!pages_change_on_record_boundaries);
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this is a nice improvement. Thanks!

@wgtmac wgtmac requested a review from pitrou October 24, 2025 02:05
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, excellent. Thank you @wgtmac !

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just don't understand would it write column with rep-levels slower

Comment on lines +1194 to +1196
// Iterate rep_levels to find the shortest sequence that ends before a record
// boundary (i.e. rep_levels == 0) with a size no less than max_batch_size
for (int64_t i = offset; i < num_levels; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do backward scan like previous algorithm acquiring last_record_begin_offset? To get page_buffered_rows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then all levels must be checked, otherwise we can't tell how many records in this batch from the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would:

  1. Existing benchmark shows an extract std::count faster?
  2. If it's slower and remaining count greater than batch_size, can we avoid checking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth arguing about this. I doubt that a simple loop on levels will be slower than encoding them using RLE-bit-packed encoding, or encoding the values, or compressing them.

(and why would std::count be faster?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing benchmark shows an extract std::count faster?

I don't quite understand this. Did you mean to use std::count to do a quick pass? I think it is already O(N) and requires the 2nd pass to delimit records.

If it's slower and remaining count greater than batch_size, can we avoid checking?

We still need to check record boundary, at least in the reverse direction as before.

Copy link
Member

@mapleFU mapleFU Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know would cpu do count job faster ( since it's simpler for cpu without branching and fast to vectorize), it depends on benchmarking

@wgtmac
Copy link
Member Author

wgtmac commented Oct 29, 2025

Just don't understand would it write column with rep-levels slower

Theoretically it might be the case. However I'm on vacation so I cannot have enough time to benchmark it for a concrete number.

@pitrou
Copy link
Member

pitrou commented Oct 29, 2025

I've ran the Parquet writing benchmarks locally and I didn't see any regression when writing REPEATED columns.

@mapleFU
Copy link
Member

mapleFU commented Oct 29, 2025

I've ran the Parquet writing benchmarks locally and I didn't see any regression when writing REPEATED columns.

Ah, so this is ok to me now

@pitrou pitrou merged commit b2190db into apache:main Oct 29, 2025
48 of 49 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Oct 29, 2025
@pitrou
Copy link
Member

pitrou commented Oct 29, 2025

This is a welcome improvement, thanks a lot @wgtmac !

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit b2190db.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@raulcd
Copy link
Member

raulcd commented Nov 11, 2025

I've created a follow-up issue to expose this on the Python bindings here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants