-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6184: [Java] Provide hash table based dictionary encoder #5058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5058 +/- ##
==========================================
+ Coverage 88.79% 89.8% +1.01%
==========================================
Files 983 732 -251
Lines 132170 112106 -20064
Branches 1501 0 -1501
==========================================
- Hits 117362 100681 -16681
+ Misses 14443 11425 -3018
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
7d558c7 to
6bd419a
Compare
6bd419a to
3a3ec44
Compare
3a3ec44 to
b4b3965
Compare
|
@emkornfield I have revised the code according to the review comments in #4994. Would you please take a look? |
|
@emkornfield ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decodingis not longer done in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thank you.
The descriptions about decoding is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a dictionary implementation in Vector that doesn't require boxing/unboxing. Is it possible to use that (or does that limit the ability to use a custom hasher)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. The dictionary implementation no longer requries boxing/unboxing. However, it does not support a custom hasher.
I think solving this problem requires some non-trivial changes to our code base. So can we do it in the future in a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are generic necessary here, what value do they provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general, a generic type makes the parameter type fixed in compilation time, which saves some effort at run time to look up the virtual table for the correct implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a reference? My understanding was these types are erased. The only times that types can be referenced at runtime is a subclass is create which the pattern that Guice's TypeLiteral uses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield
I am sorry because my previous comment makes no sense. I should have thought carefully before uttering a comment.
In general, generic has almost no impact on performance due to type erasure, as you have indicated. The only impact (if any) is related to type cast.
The benefit of generic is that it gives the Java compiler a chance to discover potential (type related) errors in the code, and prevent them from happening at runtime proactively.
This is a general problem. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of errors would the compiler detect using generics that it wouldn't detected if you replaces all references to E with BaseIntVector and D with ElementAddressableVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's give a concrete example from this PR (I am not sure if it is appropriate).
For this PR, the encoder has a constructor like this,
public HashTableDictionaryEncoder(D dictionary)
and a encode method like this,
public void encode(D input, E output)
The generic makes sure that the input parameter to the constructor and the first parameter to the encode method must have the same type.
If we do not use generic, the constructor should be:
public HashTableDictionaryEncoder(ElementAddressableVector dictionary)
and the encode method should be
public void encode(ElementAddressableVector input, ValueVector output)
With the above code, it is possible to construct the encoder with a IntVector, and call the encode method with a VarCharVector. The compiler cannot detect this error, but an exception will be thrown at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That is a good example for "D" is there similar one for "E"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think there is a similar example for E (I am not sure if it is appropriate).
Suppose we have an specific IPC specification to encode VarCharVector with an IntVector.
With generic, we can declare a encoder like this HashTableEncoder<VarCharVector, IntVector>.
If we break the specification above to encode a VarCharVector by a BigIntVector by calling:
BigIntVector out = ...
encoder.encode(in, out);
With generic, this problem will be detected by the java compiler.
However, without generic, this problem will only be detected at runtime, because BigIntVector is also a sub-type oc BaseIntVector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield
BTW, do you think we should support ListVector/MapVector, etc. with generic types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not decoding happens here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptions about decoding are removed. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you name this more descriptively to indicate that the test is for when encodeNull is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good suggestion.
I have changed the name to testEncodeNullWithoutNullInDictionary, as it tests the case where the the encoder tries to encode a null in the input, but the dictionary does not has a null.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions/comments. thanks.
b4b3965 to
8872b6e
Compare
a97fe2b to
7685b77
Compare
|
How do we intend to use the hash table based dictionary encoder? AFAIK, the dictionary for the dictionary encoded column should be sorted. If value V is encoded as E then V' where V' > V should be encoded as E' where E' > E. This will make predicate evaluation on the dictionary encoded column effiicient The underlying data structure for the dictionary could very well be the hash table but in that case we need to generate/build dictionary in some other data structure which is sorted and then put them into the hash table for subsequent usage. I don't see a sort phase for hash table dictionary encoder. |
@siddharthteotia Thanks for your valuable feedback. The encoder in this PR supports another scenario in which the dictonary is not sorted. There are two benefits for this encoder:
|
@siddharthteotia this isn't required for the IPC spec which makes ordering of dictionary columns a parameter Dictionary encodings still provide benefits in other contexts (e.g. any stateless transform on a column can occur just by applying the transform to the dictionary instead of every entry in the column) |
This is the second part of ARROW-5917. We provide a sort based encoder, as well as a hash table based encoder, to solve the problem with the current dictionary encoder.
In particular, we solve the following problems with the current encoder:
1. There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
2. Unnecessary memory copy (the vector data must be copied to the hash table).
3. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
4. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
5. The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
Closes apache#5058 from liyafan82/fly_0811_hashencode and squashes the following commits:
7685b77 <liyafan82> Resolve comments
8872b6e <liyafan82> Provide hash table based dictionary encoder
Authored-by: liyafan82 <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
This is the second part of ARROW-5917. We provide a sort based encoder, as well as a hash table based encoder, to solve the problem with the current dictionary encoder.
In particular, we solve the following problems with the current encoder: