-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47030: [C++][Parquet] Add setting to limit the number of rows written per page #47090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Please check if this is the right direction. @pitrou @mapleFU @adamreeve BTW, some existing test cases will break if I switch the default value to limit 20,000 rows per page. I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream. |
|
|
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks correct to me thanks @wgtmac.
I'm not sure if it is a good idea to use 20,000 as the default value to surprise the downstream.
A default of 100k would still change behaviour though, and most of the time result in smaller pages being written. I think it probably makes sense to use 20k to align with Java and Rust, but it could be done as a separate PR if there are a lot of test changes needed.
I don't imagine this should break any downstream code, but we'd definitely want to call it out in the release notes as something for users to be aware of.
|
I should also mention that #47032 touches the same part of the code. It looks like the fix from that PR can easily be ported to this new code though. |
|
Should this PR be set to draft until it's ready? |
|
I think this is ready for review. @pitrou |
|
Thanks. I'm on vacation so I'm going to be a bit slow, sorry! |
|
IIUC, this PR may slightly affect CDC. Let me know if you have any feedback. @kszucs |
|
It depends on the value of the a) size limits are bigger than the cdc limits, then the pages are cut earlier than the size limits would happen: So in theory it shouldn't affect the CDC's effectiveness. We can also check this before merging using https://github.com/huggingface/dataset-dedupe-estimator |
| @@ -155,6 +155,7 @@ class PARQUET_EXPORT ReaderProperties { | |||
| ReaderProperties PARQUET_EXPORT default_reader_properties(); | |||
|
|
|||
| static constexpr int64_t kDefaultDataPageSize = 1024 * 1024; | |||
| static constexpr int64_t kDefaultMaxRowsPerPage = 20'000; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have this feature as opt-in? Otherwise should we choose a bigger value to have the data page size triggered first?
Given the parquet type sizes we could end up much smaller data pages (even before encoding) than 1MB which could be unexpected to the users and also increase the overall metadata size:
- BOOLEAN: 1 bit boolean
- INT32: 32 bit signed ints
- INT64: 64 bit signed ints
- INT96: 96 bit signed ints (deprecated; only used by legacy implementations)
- FLOAT: IEEE 32-bit floating point values
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays
- FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, defaulting to 20000 will definitely create smaller pages of numeric types. 20000 is also used by parquet-java and arrow-rs and considered to be a good value per the discussion at https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is precisely to make the average data page size much smaller than 1MB, which is considered too large as a compression/encoding unit. 1MB is an additional limit in case individual values are large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou Do you have any comment on the code change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Theoretically this shouldn't affect the CDC effectiveness, on the contrary, having smaller pages will likely improve the deduplication ratio. Although the default CDC options were chosen to approach the 1MB page size limit, so I need to reconsider the defaults.
Either way, I'm checking whether this change interferes with CDC or not, theoretically it shouldn't.
|
@pitrou Gentle ping :) |
|
I've split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @wgtmac , I find it much easier to grok now. I've posted one suggestion, but otherwise LGTM.
cpp/src/parquet/column_writer.cc
Outdated
| ARROW_DCHECK_LE(offset, end_offset); | ||
| ARROW_DCHECK_LE(check_page_limit_end_offset, end_offset); | ||
|
|
||
| if (end_offset < num_levels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this nesting of ifs could perhaps be rewritten like this. I don't know if it's more readable but it it shorter and simpler.
if (check_page_limit_end_offset >= 0) {
// At least one record boundary is included in this batch.
// It is a good chance to check the page limit.
action(offset, check_page_limit_end_offset - offset, /*check_page_limit=*/true);
offset = check_page_limit_end_offset;
}
if (end_offset > offset) {
// The is the last chunk of batch, and we do not know whether end_offset is a
// record boundary so we cannot check page limit if pages cannot change on
// record boundaries.
ARROW_DCHECK_EQ(end_offset, num_levels);
action(offset, end_offset - offset,
/*check_page_limit=*/!pages_change_on_record_boundaries);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that this is a nice improvement. Thanks!
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, excellent. Thank you @wgtmac !
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just don't understand would it write column with rep-levels slower
| // Iterate rep_levels to find the shortest sequence that ends before a record | ||
| // boundary (i.e. rep_levels == 0) with a size no less than max_batch_size | ||
| for (int64_t i = offset; i < num_levels; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do backward scan like previous algorithm acquiring last_record_begin_offset? To get page_buffered_rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then all levels must be checked, otherwise we can't tell how many records in this batch from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would:
- Existing benchmark shows an extract std::count faster?
- If it's slower and remaining count greater than batch_size, can we avoid checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's worth arguing about this. I doubt that a simple loop on levels will be slower than encoding them using RLE-bit-packed encoding, or encoding the values, or compressing them.
(and why would std::count be faster?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing benchmark shows an extract std::count faster?
I don't quite understand this. Did you mean to use std::count to do a quick pass? I think it is already O(N) and requires the 2nd pass to delimit records.
If it's slower and remaining count greater than batch_size, can we avoid checking?
We still need to check record boundary, at least in the reverse direction as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know would cpu do count job faster ( since it's simpler for cpu without branching and fast to vectorize), it depends on benchmarking
Theoretically it might be the case. However I'm on vacation so I cannot have enough time to benchmark it for a concrete number. |
|
I've ran the Parquet writing benchmarks locally and I didn't see any regression when writing REPEATED columns. |
Ah, so this is ok to me now |
|
This is a welcome improvement, thanks a lot @wgtmac ! |
|
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit b2190db. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
|
I've created a follow-up issue to expose this on the Python bindings here: |
Rationale for this change
Currently only page size is limited. We need to limit number of rows per page too.
What changes are included in this PR?
Add
parquet::WriterProperties::max_rows_per_page(int64_t max_rows)to limit number of rows per data page.Are these changes tested?
Yes
Are there any user-facing changes?
Yes, users are allowed to set this config value.