-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6216: [C++] Expose codec compression level to user #5071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for the patch. Please make sure to use "make format" to fix the lint errors. |
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou do you have thoughts on this?
|
It would be nice to also expose compression level for other codecs - gzip, brotli, lz4 |
|
Perhaps this could be added to the |
|
Some notes regarding the design decision in the patch:
|
|
Alright, I see that there are build failures for clients of Arrow. I can go and try to change my patch to avoid breaking everything. |
39f66d4 to
e6c964b
Compare
|
@hatemhelal @emkornfield @Ingvar-Y @pitrou |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. There still seems to be a bit of undue complication here.
Also, perhaps people who care about Parquet should review the corresponding changes :-)
cpp/src/arrow/io/compressed-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep using the 2-argument form?
cpp/src/arrow/util/compression.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be any point in exposing this function in a header file, is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I moved it to the source file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it was fine to leave this default here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to not do that and keep all of the levels in one place.
Also, some codec source files are not always compiled into the project and thus there would have been an extra ifdef necessary just because of that in the function for retrieving the default compression level for the given compression type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the constant in compression_brotli.h and then use that constant in the implementation of the get-default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of all this? It seems this is gratuitously complicating the code, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point was to avoid having to write GetCompressionCodecDefaultCompressionLevel(...) in many places.
You might argue that than we should set a default value for the compression level parameter. However, I prefer it to have to be explicitly set within Arrow/Parquet while making it optional for the user. The way I see it this avoids potential improper calls to functions which have multiple default parameters, AND avoids having many functions be duplicated whenever it's not possible to have a default value.
Anyway, I will remove these constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that the default values are known in Arrow somewhere and there's no point in duplicating them in Parquet. Let Arrow fill in the default if no specific value is passed.
cpp/src/parquet/types.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Why not use the 2-argument form of Codec::Create if the compression level isn't specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PageWriter::Open I would like to have to the compression level already selected since there are two different paths selected:: BufferedPageWriter and SerializedPageWriter.
I do not want to introduce any default parameters and leave deduction of the compression level at a level close to the arrow::util::Codec::Create because this means there will be two similar paths maintained - one with the compression level selected and the one with no compression level.
Selecting the compression level as close to the parquet api as possible means that there will be for the most part a single path maintained. However, there is a complication that arrow maintains its own compression enum with an extra bz2 entry. Thus, there's a need for some translation like it was already done and a bit simplified in the newest iteration of the change.
I still followed your advice to some extent which also made the change slightly smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense for each codec to "know" its own default compression level rather than localizing/externalizing this information into a function that deals with all the codecs. From the parquet APIs, I think you could use placeholder value like -1 to signify using the default compression level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, -1 cannot be a placeholder because it can be a valid compression level for some codec which can be different from the default in Arrow. That's why it's best to query the default compression level close to the API. We cannot have a magic value which signals to use the default value which is some other magic value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why it's best to query the default compression level close to the API.
That doesn't follow. You could e.g. have both ZSTDCodec() and ZSTDCodec(int compression_level) constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is ZSTDCodec(int compression_level = kDefaultCompressionLevel) where kDefaultCompressionLevel is chosen such that it is unlikely to be confused with an actual compression level (for example INT_MIN).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZSTDCodec(int compression_level = kDefaultCompressionLevel) this is not any different from your early proposal of having two constructors. On higher levels in the stack it would still have to be determined whether to pass an argument or not.
I do not want to propagate some random value through the whole stack with the hope that it doesn't conflict with what the user might specify.
So, let's see. 20 users decide that they want to pass INT_MIN as a compression level. Sure, it doesn't make sense but it works! ZSTD would be set up to use a compression level of 1 since INT_MIN is this magic value which is a hint that arrow is free to select whatever it wants.
Then, arrow changes the implementation a bit so that instead of INT_MIN we pick the much nicer 0x31337 magic value.
All user code suddenly starts behaving in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On higher levels in the stack it would still have to be determined whether to pass an argument or not.
No, you could simply pass kDefaultCompressionLevel.
So, let's see. 20 users decide that they want to pass INT_MIN as a compression level. Sure, it doesn't make sense but it works!
I am not very interested in discussing contrived scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you could simply pass kDefaultCompressionLevel.
Yes, but kDefaultCompressionLevel will be that magic value you suggest introducing.
I am not very interested in discussing contrived scenarios.
It is a hack to have an internal representation which can conflict with what the user provides. Especially, if it can easily be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, -1 cannot be a placeholder because it can be a valid compression level for some codec which can be different from the default in Arrow.
Ok, thanks for correcting me. Another suggestion would be to add a method on ColumnProperties that can tell you whether to use the codec defaults or override it with a user-supplied value. Something like:
bool using_compression_level
that only returns true when a user has manually set the compression level. I've seen problems like this solved with boost::optional but feels like overkill for this case, but as you've seen I'm often wrong so maybe check that out too.
|
Thanks for the reviews. |
21cc6f6 to
898362c
Compare
Codecov Report
@@ Coverage Diff @@
## master #5071 +/- ##
==========================================
- Coverage 87.63% 75.3% -12.34%
==========================================
Files 1022 57 -965
Lines 146258 3620 -142638
Branches 1437 0 -1437
==========================================
- Hits 128177 2726 -125451
+ Misses 17719 894 -16825
+ Partials 362 0 -362Continue to review full report at Codecov.
|
|
@emkornfield are you comfortable with the changes? |
31cfbe7 to
77221eb
Compare
hatemhelal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinradev this is looking much improved from when I last reviewed. I do think this can be further simplified by making each codec responsible for managing its own default compression level.
cpp/src/parquet/types.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense for each codec to "know" its own default compression level rather than localizing/externalizing this information into a function that deals with all the codecs. From the parquet APIs, I think you could use placeholder value like -1 to signify using the default compression level.
cpp/src/parquet/types.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to switch to int32_t? It looks like int was previously used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I used int32_t from the start for everywhere I propagated the compression level. I also used int32_t since it looks like that's the preferred type in Arrow and other programming guidelines.
I know that the compression libraries used in Arrow have the compression level as 'int'. I don't think this is a problem for any system using Arrow but I could replace all int32_t with int if you expect int and int32_t to not have the same size.
|
@wesm can you please review this patch? |
|
My drive by comment is I tend to agree with @pitrou and @hatemhelal but understand being careful about collisions. A few suggestions which may or may not be a good idea:
Also it would be nice to keep existing APIs in place if at all possible and possibly deprecate them instead of replacing them. |
|
If we want to gold-plate this, we should probably make the additional argument a But I still think a compression level with a default value like And using |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I think there's some things we can do to make things simpler. I left some stylistic comments also.
cpp/src/arrow/util/compression.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making this a static Codec::GetDefaultCompressionLevel. Little tidier that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't because then this would add a dependency on the implementation of the codec. The implementation would only be visible if the library is compiled in and this would introduce a lot of ifdefs (ifdef ARROW_WITH_ZLIB, etc).
However, I can change it as you requested. I'm probably not getting this patch in if I would be stubborn for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to follow what @pitrou recommended by using numeric_limits::min as a hint to let the implementation select whatever compression level it deems appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm making this a static method... I will push changes to your branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the constant in compression_brotli.h and then use that constant in the implementation of the get-default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undid my change with regards to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to return an error if you specify a compression level for a compressor that does not support it? Either way we need tests asserting what the behavior is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to mention in the api for parquet-cpp that parquet/arrow do not do validation on the passed compression level.
- it could be some effort to track what is the range of possible compression levels for a version of a compressor. Thus, it's best if the user finds out what's meaningful. Worst case initializing the codec would fail.
- With that I also decided that I would not add validation for whether a compressor supports specifying a compression level. It's a hint and it doesn't necessarily lead to better compression ratio or speed.
I hope you're fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undid my change with regards to this.
cpp/src/parquet/file_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This almost suggests it would be better to pass properties_ and the column path into PageWriter::Open instead. Not sure how much refactoring that would be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's only the compression type and level which are passed, and in this particular case the memory pool. However, not all three are always passed but rather the compression level and codec. Thus, I do not want to pass properties since there is a lot more unnecessary things there.
cpp/src/parquet/file_writer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as above re: this code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered above.
cpp/src/parquet/types.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems silly to me to continue to maintain this code. Can we do using Compression = ::arrow::Compression in parquet/types.h and move on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this, you mean parquet::Compression to be removed and replaced with arrow::Compression?
The API would be broken then since this would also expose Compression::BZ2 which is not a valid compression algorithm for Parquet.
I do not know the history of the project and why this is so.
Furthermore, the enum values are different for arrow::Compression and parquet::Compression. I am not sure whether this could create back-compatibility issues all of a sudden.
To me it looks that some refactoring would be necessary and that it's not part of this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would prohibit unsupported compressions in Parquet.
@xhochy @majetideepak any thoughts about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this either way as long as we are to get programmatically which codecs are supported by Parquet.
|
@martinradev the PR description is no longer accurate, can you update it for the sake of the changelog? |
|
@wesm @pitrou Thanks for the reviews. I made some more changes and I think the current version of the patch is closer to what @pitrou wanted from the start. Can you give another review? |
cpp/src/parquet/properties.h
Outdated
| ColumnProperties default_column_properties_; | ||
| std::unordered_map<std::string, Encoding::type> encodings_; | ||
| std::unordered_map<std::string, Compression::type> codecs_; | ||
| std::unordered_map<std::string, int32_t> codecs_compression_level_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this should be int. I can write a follow-up patch if there are no more things to fix in this one.
If there are more things to fix, then I will fix it alongside whatever needs to be addressed.
|
@wesm Can you have a second look? |
|
Looking. In the future, can you use a feature branch? Using master makes things more difficult for project maintainers |
cpp/src/parquet/properties.h
Outdated
| * If no level is selected by the user or if the special | ||
| * std::numeric_limits<int>::min() value is passed, then Arrow selects the compression | ||
| * level. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use /// comments. I will fix
|
Bah, I got thwarted by the GitHub UI. It's difficult for me to edit your branch when it's "master" so I will open a new PR |
|
See #5192 |
…d to Parquet writer properties This patch allows the user to select a compression level to be used for when column data in Parquet is compressed. The user can select between better compression ratio or speed. Reopening PR to supersede #5071 since the GitHub UI closed the PR when I attempted to push commits to the contributor's master branch Closes #5192 from wesm/ARROW-6216-compression and squashes the following commits: 8f12461 <Wes McKinney> Remove GzipCodec ctor that's ambiguous in MSVC beafdda <Wes McKinney> clang-format 1c90586 <Wes McKinney> Move default compression levels to headers, remove nullary ctors 06d8a14 <Wes McKinney> Use arrow::Compression in parquet:: namespace, address other stylistic nits 7f3bf21 <Martin Radev> ARROW-6216: Expose codec compression level to user Lead-authored-by: Martin Radev <[email protected]> Co-authored-by: Wes McKinney <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
This patch allows the user to select a compression level to be used for when column data in Parquet is compressed.
The user can select between better compression ratio or speed.