-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields #2681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is a WIP patch but I first want to make sure the behavior is what we want. @BryanCutler @jacques-n do you think this makes sense? |
|
Also interested to hear from @wesm how does similar API in Python/C++ behaves. |
|
gentle ping @jacques-n |
|
Because we don't have the same builder API that Java does, we don't have this precise issue in C++. We have a DictionaryBuilder class for incremental builds of dictionary-encoded arrays. Otherwise the DictionaryArray is composed from the indices and the dictionary type |
|
cc @elahrvivaz . Wes mentioned you guys might be using dictionary encoding. I wonder if you have some opinion on this? |
|
@icexelloss this does seem unintuitive to me, but I believe that the way it works now is correct. The field type is the dictionary encoded type, not the decoded type. To see the actual decoded type, you have to look up the corresponding dictionary and get its type. I don't think that this is explicitly defined in https://github.com/apache/arrow/blob/master/format/Schema.fbs, but I believe that I initiated some discussion on it at the time. This means that any dictionary encoding has to be handled by the user, and isn't handled at all by the arrow library itself. |
|
@elahrvivaz Thanks for the input. I am curious how are you creating dictionary encoded vector from schema now, can you show me your example usage?
Is it possible to find the discussion somewhere? |
|
@icexelloss I think most of the discussion I was thinking of is in the initial Java dictionary PR here: #309 As for creating the vectors, I was looking through our code, and I believe that we never invoke |
|
@elahrvivaz Thanks for the explanation. I read through the discussion but I am not sure that I agree with you here.. I think the behavior of Should be the same as Since in the (2), the user passes encoded vectors for dictionary encoded field, I think (1) should also create encoded vectors for dictionary encoded field. However, (1) now creates decoded vectors and I should be fixed. Do you think otherwise? |
|
I believe that the vectors/fields that we're passing do correspond to the same fields that would be created by Thanks, |
|
@elahrvivaz Aha I see. Thanks for the clarification. If that's the case then this PR can be closed. Let me open a separate PR to address the documentation issue. |
|
I assume that the C++ implementation encodes the fields the same way, otherwise the integration tests would fail. It does seem a bit confusing and maybe it would be worth revisiting. |
|
Back to this. From @wesm 's comment in #2798:
Since in Java, the schema in the IPC directly maps to the Schema object in Java, I think the correct usage is to pass the decoded type to the field object, i.e.: instead of @elahrvivaz @BryanCutler do you guys agree? |
|
This was what I was saying about things being hacky in Java. You mentioned How hard would it be to create a |
|
@wesm I think the issue is that the Schema and Field object are mapped directly to json presentation of the schema in IPC, for example: Mapping between the Java object and Flatbuffer object is not hard to do because it's hand written: So it's not as easy to introduce a synthetic DictionaryType similar to C++ because I think most of the Java classes are designed to map to the IPC types. I think it's possible to decouple the Java classes from the IPC types but it could be pretty large effort. |
|
I thought we were talking about removing the JSON dependency from Arrow and making it a test dependency instead. That's the way things are in C++. @jacques-n? |
|
I would probably prefer JSON to be a dependency in arrow tools because I have been places that people want to use the JSON presentation of Arrow table for debugging purpose (for example, in a rest service that returns Arrow table). Besides JSON issue, the Java code still has some codegen using https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/data/ArrowTypes.tdd I think introducing a synthetic type here might add some complexity to other parts of the code. Maybe it's not too bad but without implementing it it's hard to know for sure. |
|
I find this entire thread quite hard to follow. It feels like we're developing APIs without a clear set of sample usages in mind. Can we start with the example program that we want to use the APIs and discuss the key requirements? For example:
Per JSON comments, yes, Jackson should be an optional separate module around json serialization. I would think it is entirely its own module and doesn't relate to other tools, etc. |
|
I also had a difficult time following the thread. The title of the PR is not clear, for example. It sounds like the problem is "VectorSchemaRoot.getVector does not return the expected vector type for dictionary-encoded fields" |
|
@wesm @jacques-n Sorry for the confusion. Let me clear up the problem statement a bit and update the PR description. Let's go from there. I appreciate your time providing feedbacks so far and apologies for not being very organized here. |
|
@icexelloss I think you've hit the main sticking point, in that the code-gen, ArrowType/FieldType, as well as the FieldReader/Writer code would all have to be modified to account for dictionary-encoded fields, and there wasn't an appetite to make that kind of wholesale change when the dictionary encoding was first introduced. |
Per discussion on #2681 (comment) Author: Li Jin <[email protected]> Closes #2798 from icexelloss/ARROW-3566 and squashes the following commits: 4c6ebb1 <Li Jin> type should be decoded type 594c5a4 <Li Jin> Clarify the type of dictonary encoded field
|
@emkornfield if you have the time and interest, dictionary-encoding in general in Java could use some TLC |
|
@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you? |
I am interested in it, and I would like to take a look. However, I may need some time to get familiar with the related code and the background. |
@emkornfield I can dedicate some review/help bandwidth, but bit stuck for the next few weeks with an internal release. |
|
Thanks @liyafan82 if I understand the rough scope of change it might be fairly large, if that is the case, please discuss any proposed solution on the mailing list before getting to far into coding to make sure we can get consensus on the approach. |
Sure. Sounds reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a style problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I wonder if this PR is old enough that we didn't have checkstyle enforcement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield , I have prepared a doc for the dictionary related use cases.
This PR solves problem 2 in the doc (misleading constructor), and it is a relatively small problem and involves small changes.
Can we first merge this PR and close this issue? It looks to me except some style problems, and some comments need to be removed.
6d4808a to
102d9a2
Compare
|
Closing this for now, let's continue the discussion about dictionary-encoded data in Java in the mailing list |
When creating a dictionary encoded vector from Schema, I expect that a vector of encoded type is created, however, currently will create a decoded vector, i.e.:
I expect to get an encoded vector because that's the underlying memory representation. I also got an encoded vector when consuming a dictionary encoded vector so I think on the producer side it should be the same.