-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6933: [Java] Suppor linear dictionary encoder #5692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@emkornfield This is almost the same as #5058. Would you please take a look? |
|
|
||
| /** | ||
| * The dictionary for encoding/decoding. | ||
| * It must be sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think it needs to be sorted? or at least clarify sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the mistake. This sentence is removed.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of minor comments on documentation, can be merged afterwards.
|
|
||
| /** | ||
| * Constructs a dictionary encoder. | ||
| * @param dictionary the dictionary. Its entries should be sorted in the non-increasing order of their frequency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorting is performance not correctness correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is made explicity in the revised code.
| private Range range; | ||
|
|
||
| /** | ||
| * Constructs a dictionary encoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that dictionary encoding is false by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is made explicit in the revised code.
Codecov Report
@@ Coverage Diff @@
## master #5692 +/- ##
==========================================
+ Coverage 88.93% 89.98% +1.04%
==========================================
Files 989 739 -250
Lines 134508 114814 -19694
Branches 1501 0 -1501
==========================================
- Hits 119627 103311 -16316
+ Misses 14516 11503 -3013
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits: 1) We need no extra memory space or data structure. 2) The search is extremely efficient, as we are likely to find a match in the first few entries of the dictionary. This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders. Closes apache#5692 from liyafan82/fly_1018_line and squashes the following commits: 78d7196 <liyafan82> Improve Javadocs 2268438 <liyafan82> Suppor linear dictionary encoder Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits:
This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders.