Skip to content

Conversation

@martinradev
Copy link
Contributor

@martinradev martinradev commented Aug 12, 2019

This patch allows the user to select a compression level to be used for when column data in Parquet is compressed.
The user can select between better compression ratio or speed.

@emkornfield
Copy link
Contributor

Thank you for the patch. Please make sure to use "make format" to fix the lint errors.

@emkornfield emkornfield changed the title ARROW-6216: Expose ZSTD compression level to user ARROW-6216: [C++] Expose ZSTD compression level to user Aug 13, 2019
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou do you have thoughts on this?

@Ingvar-Y
Copy link
Contributor

It would be nice to also expose compression level for other codecs - gzip, brotli, lz4

@hatemhelal
Copy link
Contributor

Perhaps this could be added to the parquet::ColumnProperties? Your client is still free to set a default value on a project level or even override it on a file-by-file basis.

@martinradev
Copy link
Contributor Author

Some notes regarding the design decision in the patch:

  • As suggested, the compression level becomes part of ColumnProperties and has to be piped to the creation of the codec.
  • I explicitly did not add validation for what would be a reasonable compression level. That is because it would add maintenance overhead once the version of some codec is updated. As in, the range of 0, 30 makes sense for version 2.4 but for 2.5 the range is 0, 40. I am just sure that this will be missed when a compressor is updated. So, it's up to the user for them to know what to use, what compressor version is used and what compression level makes sense for his or her needs.
  • I added and updated tests to exercise the API but they are really smoke tests. It only verifies that the API is callable and no failures are observed.

@martinradev
Copy link
Contributor Author

Alright, I see that there are build failures for clients of Arrow. I can go and try to change my patch to avoid breaking everything.

@martinradev martinradev changed the title ARROW-6216: [C++] Expose ZSTD compression level to user ARROW-6216: [C++] Expose codec compression level to user Aug 15, 2019
@martinradev martinradev force-pushed the master branch 2 times, most recently from 39f66d4 to e6c964b Compare August 15, 2019 13:14
@martinradev
Copy link
Contributor Author

@hatemhelal @emkornfield @Ingvar-Y @pitrou
Thank you for the initial round of reviews. The code has significantly changed.
Can you have another look?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. There still seems to be a bit of undue complication here.
Also, perhaps people who care about Parquet should review the corresponding changes :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep using the 2-argument form?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be any point in exposing this function in a header file, is there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I moved it to the source file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it was fine to leave this default here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to not do that and keep all of the levels in one place.
Also, some codec source files are not always compiled into the project and thus there would have been an extra ifdef necessary just because of that in the function for retrieving the default compression level for the given compression type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the constant in compression_brotli.h and then use that constant in the implementation of the get-default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of all this? It seems this is gratuitously complicating the code, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point was to avoid having to write GetCompressionCodecDefaultCompressionLevel(...) in many places.
You might argue that than we should set a default value for the compression level parameter. However, I prefer it to have to be explicitly set within Arrow/Parquet while making it optional for the user. The way I see it this avoids potential improper calls to functions which have multiple default parameters, AND avoids having many functions be duplicated whenever it's not possible to have a default value.
Anyway, I will remove these constants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that the default values are known in Arrow somewhere and there's no point in duplicating them in Parquet. Let Arrow fill in the default if no specific value is passed.

Copy link
Member

@pitrou pitrou Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Why not use the 2-argument form of Codec::Create if the compression level isn't specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PageWriter::Open I would like to have to the compression level already selected since there are two different paths selected:: BufferedPageWriter and SerializedPageWriter.
I do not want to introduce any default parameters and leave deduction of the compression level at a level close to the arrow::util::Codec::Create because this means there will be two similar paths maintained - one with the compression level selected and the one with no compression level.
Selecting the compression level as close to the parquet api as possible means that there will be for the most part a single path maintained. However, there is a complication that arrow maintains its own compression enum with an extra bz2 entry. Thus, there's a need for some translation like it was already done and a bit simplified in the newest iteration of the change.
I still followed your advice to some extent which also made the change slightly smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense for each codec to "know" its own default compression level rather than localizing/externalizing this information into a function that deals with all the codecs. From the parquet APIs, I think you could use placeholder value like -1 to signify using the default compression level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, -1 cannot be a placeholder because it can be a valid compression level for some codec which can be different from the default in Arrow. That's why it's best to query the default compression level close to the API. We cannot have a magic value which signals to use the default value which is some other magic value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why it's best to query the default compression level close to the API.

That doesn't follow. You could e.g. have both ZSTDCodec() and ZSTDCodec(int compression_level) constructors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is ZSTDCodec(int compression_level = kDefaultCompressionLevel) where kDefaultCompressionLevel is chosen such that it is unlikely to be confused with an actual compression level (for example INT_MIN).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSTDCodec(int compression_level = kDefaultCompressionLevel) this is not any different from your early proposal of having two constructors. On higher levels in the stack it would still have to be determined whether to pass an argument or not.

I do not want to propagate some random value through the whole stack with the hope that it doesn't conflict with what the user might specify.

So, let's see. 20 users decide that they want to pass INT_MIN as a compression level. Sure, it doesn't make sense but it works! ZSTD would be set up to use a compression level of 1 since INT_MIN is this magic value which is a hint that arrow is free to select whatever it wants.
Then, arrow changes the implementation a bit so that instead of INT_MIN we pick the much nicer 0x31337 magic value.
All user code suddenly starts behaving in a different way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On higher levels in the stack it would still have to be determined whether to pass an argument or not.

No, you could simply pass kDefaultCompressionLevel.

So, let's see. 20 users decide that they want to pass INT_MIN as a compression level. Sure, it doesn't make sense but it works!

I am not very interested in discussing contrived scenarios.

Copy link
Contributor Author

@martinradev martinradev Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you could simply pass kDefaultCompressionLevel.

Yes, but kDefaultCompressionLevel will be that magic value you suggest introducing.

I am not very interested in discussing contrived scenarios.

It is a hack to have an internal representation which can conflict with what the user provides. Especially, if it can easily be avoided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, -1 cannot be a placeholder because it can be a valid compression level for some codec which can be different from the default in Arrow.

Ok, thanks for correcting me. Another suggestion would be to add a method on ColumnProperties that can tell you whether to use the codec defaults or override it with a user-supplied value. Something like:

bool using_compression_level

that only returns true when a user has manually set the compression level. I've seen problems like this solved with boost::optional but feels like overkill for this case, but as you've seen I'm often wrong so maybe check that out too.

@martinradev
Copy link
Contributor Author

Thanks for the reviews.
I had to rebase and I simplified the change a little bit.
@pitrou Can you please add whomever you think is an appropriate reviewer for the changes to the parquet reader/writer implementation?

@martinradev martinradev force-pushed the master branch 2 times, most recently from 21cc6f6 to 898362c Compare August 17, 2019 18:25
@pitrou pitrou requested a review from wesm August 17, 2019 18:47
@codecov-io
Copy link

codecov-io commented Aug 17, 2019

Codecov Report

Merging #5071 into master will decrease coverage by 12.33%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #5071       +/-   ##
==========================================
- Coverage   87.63%   75.3%   -12.34%     
==========================================
  Files        1022      57      -965     
  Lines      146258    3620   -142638     
  Branches     1437       0     -1437     
==========================================
- Hits       128177    2726   -125451     
+ Misses      17719     894    -16825     
+ Partials      362       0      -362
Impacted Files Coverage Δ
python/pyarrow/ipc.pxi
cpp/src/parquet/column_page.h
cpp/src/plasma/test/external_store_tests.cc
cpp/src/arrow/array/builder_decimal.cc
cpp/src/plasma/client.cc
rust/parquet/src/arrow/converter.rs
cpp/src/arrow/compute/compute_test.cc
cpp/src/arrow/python/io.cc
python/pyarrow/hdfs.py
cpp/src/arrow/adapters/orc/adapter_test.cc
... and 937 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f413c8f...7f3bf21. Read the comment docs.

@martinradev
Copy link
Contributor Author

@emkornfield are you comfortable with the changes?

@martinradev martinradev force-pushed the master branch 2 times, most recently from 31cfbe7 to 77221eb Compare August 19, 2019 21:38
Copy link
Contributor

@hatemhelal hatemhelal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinradev this is looking much improved from when I last reviewed. I do think this can be further simplified by making each codec responsible for managing its own default compression level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense for each codec to "know" its own default compression level rather than localizing/externalizing this information into a function that deals with all the codecs. From the parquet APIs, I think you could use placeholder value like -1 to signify using the default compression level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to switch to int32_t? It looks like int was previously used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I used int32_t from the start for everywhere I propagated the compression level. I also used int32_t since it looks like that's the preferred type in Arrow and other programming guidelines.
I know that the compression libraries used in Arrow have the compression level as 'int'. I don't think this is a problem for any system using Arrow but I could replace all int32_t with int if you expect int and int32_t to not have the same size.

@martinradev
Copy link
Contributor Author

@wesm can you please review this patch?

@emkornfield
Copy link
Contributor

emkornfield commented Aug 20, 2019

My drive by comment is I tend to agree with @pitrou and @hatemhelal but understand being careful about collisions. A few suggestions which may or may not be a good idea:

  • use boost::optional which is already used in gandiva (preferably we would have a vendored version)
  • simulate optional in this case by grouping compression type and compression into a class that tracks if compression level is set. I think this makes for a cleaner API anyway
  • I'm ok risking using a sentinel value that might tie our hands for future compression type e.g int_min

Also it would be nice to keep existing APIs in place if at all possible and possibly deprecate them instead of replacing them.

@pitrou
Copy link
Member

pitrou commented Aug 20, 2019

If we want to gold-plate this, we should probably make the additional argument a CompressionOptions struct rather than a compression level integer.

But I still think a compression level with a default value like int kDefaultCompressionLevel = INT_MIN is completely fine. I don't care if the user passes such an unlikely value explicitly.

And using boost::optional for this sounds ridiculous.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I think there's some things we can do to make things simpler. I left some stylistic comments also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this a static Codec::GetDefaultCompressionLevel. Little tidier that way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't because then this would add a dependency on the implementation of the codec. The implementation would only be visible if the library is compiled in and this would introduce a lot of ifdefs (ifdef ARROW_WITH_ZLIB, etc).

However, I can change it as you requested. I'm probably not getting this patch in if I would be stubborn for everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to follow what @pitrou recommended by using numeric_limits::min as a hint to let the implementation select whatever compression level it deems appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making this a static method... I will push changes to your branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the constant in compression_brotli.h and then use that constant in the implementation of the get-default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undid my change with regards to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to return an error if you specify a compression level for a compressor that does not support it? Either way we need tests asserting what the behavior is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to mention in the api for parquet-cpp that parquet/arrow do not do validation on the passed compression level.

  1. it could be some effort to track what is the range of possible compression levels for a version of a compressor. Thus, it's best if the user finds out what's meaningful. Worst case initializing the codec would fail.
  2. With that I also decided that I would not add validation for whether a compressor supports specifying a compression level. It's a hint and it doesn't necessarily lead to better compression ratio or speed.

I hope you're fine with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undid my change with regards to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This almost suggests it would be better to pass properties_ and the column path into PageWriter::Open instead. Not sure how much refactoring that would be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's only the compression type and level which are passed, and in this particular case the memory pool. However, not all three are always passed but rather the compression level and codec. Thus, I do not want to pass properties since there is a lot more unnecessary things there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above re: this code duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems silly to me to continue to maintain this code. Can we do using Compression = ::arrow::Compression in parquet/types.h and move on?

Copy link
Contributor Author

@martinradev martinradev Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this, you mean parquet::Compression to be removed and replaced with arrow::Compression?
The API would be broken then since this would also expose Compression::BZ2 which is not a valid compression algorithm for Parquet.
I do not know the history of the project and why this is so.

Furthermore, the enum values are different for arrow::Compression and parquet::Compression. I am not sure whether this could create back-compatibility issues all of a sudden.

To me it looks that some refactoring would be necessary and that it's not part of this patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would prohibit unsupported compressions in Parquet.

@xhochy @majetideepak any thoughts about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this either way as long as we are to get programmatically which codecs are supported by Parquet.

@wesm
Copy link
Member

wesm commented Aug 20, 2019

@martinradev the PR description is no longer accurate, can you update it for the sake of the changelog?

@martinradev
Copy link
Contributor Author

@wesm @pitrou Thanks for the reviews. I made some more changes and I think the current version of the patch is closer to what @pitrou wanted from the start.
I also changed the type of the compression level from int32_t to int since this is what all supported compressors with the option for level selection expect.

Can you give another review?

ColumnProperties default_column_properties_;
std::unordered_map<std::string, Encoding::type> encodings_;
std::unordered_map<std::string, Compression::type> codecs_;
std::unordered_map<std::string, int32_t> codecs_compression_level_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this should be int. I can write a follow-up patch if there are no more things to fix in this one.
If there are more things to fix, then I will fix it alongside whatever needs to be addressed.

@martinradev
Copy link
Contributor Author

@wesm Can you have a second look?

@wesm
Copy link
Member

wesm commented Aug 25, 2019

Looking. In the future, can you use a feature branch? Using master makes things more difficult for project maintainers

* If no level is selected by the user or if the special
* std::numeric_limits<int>::min() value is passed, then Arrow selects the compression
* level.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use /// comments. I will fix

@wesm
Copy link
Member

wesm commented Aug 25, 2019

Bah, I got thwarted by the GitHub UI. It's difficult for me to edit your branch when it's "master" so I will open a new PR

@wesm
Copy link
Member

wesm commented Aug 25, 2019

See #5192

wesm added a commit that referenced this pull request Aug 26, 2019
…d to Parquet writer properties

This patch allows the user to select a compression level to be used for when column data in Parquet is compressed. The user can select between better compression ratio or speed.

Reopening PR to supersede #5071 since the GitHub UI closed the PR when I attempted to push commits to the contributor's master branch

Closes #5192 from wesm/ARROW-6216-compression and squashes the following commits:

8f12461 <Wes McKinney> Remove GzipCodec ctor that's ambiguous in MSVC
beafdda <Wes McKinney> clang-format
1c90586 <Wes McKinney> Move default compression levels to headers, remove nullary ctors
06d8a14 <Wes McKinney> Use arrow::Compression in parquet:: namespace, address other stylistic nits
7f3bf21 <Martin Radev> ARROW-6216:  Expose codec compression level to user

Lead-authored-by: Martin Radev <[email protected]>
Co-authored-by: Wes McKinney <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants