Skip to content

Conversation

@liyafan82
Copy link
Contributor

This is the second part of ARROW-5917. We provide a sort based encoder, as well as a hash table based encoder, to solve the problem with the current dictionary encoder.

In particular, we solve the following problems with the current encoder:

1. There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
2. Unnecessary memory copy (the vector data must be copied to the hash table).
3. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
4. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
5. The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.

@codecov-io
Copy link

codecov-io commented Aug 11, 2019

Codecov Report

Merging #5058 into master will increase coverage by 1.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5058      +/-   ##
==========================================
+ Coverage   88.79%    89.8%   +1.01%     
==========================================
  Files         983      732     -251     
  Lines      132170   112106   -20064     
  Branches     1501        0    -1501     
==========================================
- Hits       117362   100681   -16681     
+ Misses      14443    11425    -3018     
+ Partials      365        0     -365
Impacted Files Coverage Δ
python/pyarrow/plasma.py 58.9% <0%> (-1.37%) ⬇️
python/pyarrow/tests/test_parquet.py 95.24% <0%> (-0.06%) ⬇️
r/src/recordbatch.cpp
go/arrow/math/uint64_amd64.go
r/R/filesystem.R
r/R/list.R
r/src/symbols.cpp
r/src/array_to_vector.cpp
go/arrow/ipc/file_reader.go
js/src/builder/index.ts
... and 245 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af097e6...8872b6e. Read the comment docs.

@liyafan82
Copy link
Contributor Author

liyafan82 commented Sep 20, 2019

@emkornfield I have revised the code according to the review comments in #4994. Would you please take a look?

@wesm
Copy link
Member

wesm commented Oct 3, 2019

@emkornfield ping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decodingis not longer done in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thank you.
The descriptions about decoding is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a dictionary implementation in Vector that doesn't require boxing/unboxing. Is it possible to use that (or does that limit the ability to use a custom hasher)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The dictionary implementation no longer requries boxing/unboxing. However, it does not support a custom hasher.
I think solving this problem requires some non-trivial changes to our code base. So can we do it in the future in a separate issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are generic necessary here, what value do they provide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general, a generic type makes the parameter type fixed in compilation time, which saves some effort at run time to look up the virtual table for the correct implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reference? My understanding was these types are erased. The only times that types can be referenced at runtime is a subclass is create which the pattern that Guice's TypeLiteral uses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield
I am sorry because my previous comment makes no sense. I should have thought carefully before uttering a comment.

In general, generic has almost no impact on performance due to type erasure, as you have indicated. The only impact (if any) is related to type cast.

The benefit of generic is that it gives the Java compiler a chance to discover potential (type related) errors in the code, and prevent them from happening at runtime proactively.

This is a general problem. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of errors would the compiler detect using generics that it wouldn't detected if you replaces all references to E with BaseIntVector and D with ElementAddressableVector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's give a concrete example from this PR (I am not sure if it is appropriate).

For this PR, the encoder has a constructor like this,

public HashTableDictionaryEncoder(D dictionary)

and a encode method like this,

public void encode(D input, E output)

The generic makes sure that the input parameter to the constructor and the first parameter to the encode method must have the same type.

If we do not use generic, the constructor should be:

public HashTableDictionaryEncoder(ElementAddressableVector dictionary)

and the encode method should be

public void encode(ElementAddressableVector input, ValueVector output)

With the above code, it is possible to construct the encoder with a IntVector, and call the encode method with a VarCharVector. The compiler cannot detect this error, but an exception will be thrown at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That is a good example for "D" is there similar one for "E"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think there is a similar example for E (I am not sure if it is appropriate).

Suppose we have an specific IPC specification to encode VarCharVector with an IntVector.
With generic, we can declare a encoder like this HashTableEncoder<VarCharVector, IntVector>.

If we break the specification above to encode a VarCharVector by a BigIntVector by calling:

BigIntVector out = ...
encoder.encode(in, out);

With generic, this problem will be detected by the java compiler.

However, without generic, this problem will only be detected at runtime, because BigIntVector is also a sub-type oc BaseIntVector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield
BTW, do you think we should support ListVector/MapVector, etc. with generic types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not decoding happens here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions about decoding are removed. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name this more descriptively to indicate that the test is for when encodeNull is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good suggestion.
I have changed the name to testEncodeNullWithoutNullInDictionary, as it tests the case where the the encoder tries to encode a null in the input, but the dictionary does not has a null.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/comments. thanks.

@kszucs kszucs force-pushed the fly_0811_hashencode branch from b4b3965 to 8872b6e Compare October 5, 2019 10:08
@liyafan82 liyafan82 force-pushed the fly_0811_hashencode branch from a97fe2b to 7685b77 Compare October 8, 2019 03:04
@siddharthteotia
Copy link
Contributor

siddharthteotia commented Oct 12, 2019

How do we intend to use the hash table based dictionary encoder? AFAIK, the dictionary for the dictionary encoded column should be sorted. If value V is encoded as E then V' where V' > V should be encoded as E' where E' > E. This will make predicate evaluation on the dictionary encoded column effiicient

The underlying data structure for the dictionary could very well be the hash table but in that case we need to generate/build dictionary in some other data structure which is sorted and then put them into the hash table for subsequent usage. I don't see a sort phase for hash table dictionary encoder.

@liyafan82
Copy link
Contributor Author

How do we intend to use the hash table based dictionary encoder? AFAIK, the dictionary for the dictionary encoded column should be sorted. If value V is encoded as E then V' where V' > V should be encoded as E' where E' > E. This will make predicate evaluation on the dictionary encoded column.

The underlying data structure for the dictionary could very well be the hash table but in that case we need to generate/build dictionary in some other data structure which is sorted and then put them into the hash table for subsequent usage. I don't see a sort phase for hash table dictionary encoder.

@siddharthteotia Thanks for your valuable feedback.
I can understand the scenario you described, and we already have SearchDictionaryEncoder to support it.

The encoder in this PR supports another scenario in which the dictonary is not sorted. There are two benefits for this encoder:

  1. the dictionary entries can be added in the same order they are encountered. The benefit of this is that we can build the dictionary and encode incoming data simultaneously. For details, please see Section 3.1.5 (online dictionary builder) of our design doc.
  2. In theory, the hash table based implementation encodes each value in O(1) time. However, for a sorted dictionary, we need O(log(n)) time to search for it.

@emkornfield
Copy link
Contributor

AFAIK, the dictionary for the dictionary encoded column should be sorted. If value V is encoded as E then V' where V' > V should be encoded as E' where E' > E. This will make predicate evaluation on the dictionary encoded column efficient

@siddharthteotia this isn't required for the IPC spec which makes ordering of dictionary columns a parameter

Dictionary encodings still provide benefits in other contexts (e.g. any stateless transform on a column can occur just by applying the transform to the dictionary instead of every entry in the column)

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
This is the second part of ARROW-5917. We provide a sort based encoder, as well as a hash table based encoder, to solve the problem with the current dictionary encoder.

In particular, we solve the following problems with the current encoder:

    1. There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
    2. Unnecessary memory copy (the vector data must be copied to the hash table).
    3. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
    4. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
    5. The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.

Closes apache#5058 from liyafan82/fly_0811_hashencode and squashes the following commits:

7685b77 <liyafan82>  Resolve comments
8872b6e <liyafan82>  Provide hash table based dictionary encoder

Authored-by: liyafan82 <[email protected]>
Signed-off-by: Micah Kornfield <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants