ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields #2681

icexelloss · 2018-10-01T17:46:24Z

When creating a dictionary encoded vector from Schema, I expect that a vector of encoded type is created, however, currently will create a decoded vector, i.e.:

    // Create the dictionary
    VarCharVector dictVector = (VarCharVector)
        FieldType.nullable(new ArrowType.Utf8())
            .createNewSingleVector("enum1_dict", allocator, null);
    dictVector.allocateNew();
    dictVector.set(0, "foo".getBytes(StandardCharsets.UTF_8));
    dictVector.set(1, "bar".getBytes(StandardCharsets.UTF_8));
    dictVector.setValueCount(2);

    DictionaryEncoding enum1Encoding = new DictionaryEncoding(
        0,
        true,
        new ArrowType.Int(8, true)
    );

    DictionaryProvider dictionaryProvider = new DictionaryProvider.MapDictionaryProvider(
        new Dictionary(dictVector, enum1Encoding)
    );

    // Create the dictionary encoded vector
    Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the decoded value
                new ArrowType.Utf8(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

    // This doesn't not work without the patch
    // root.getVector("enum1") returns a VarCharVector instead of a TinyIntVector
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    TinyIntVector vector = (TinyIntVector) root.getVector("enum1");
    vector.allocateNew();

I expect to get an encoded vector because that's the underlying memory representation. I also got an encoded vector when consuming a dictionary encoded vector so I think on the producer side it should be the same.

icexelloss · 2018-10-01T17:51:34Z

This is a WIP patch but I first want to make sure the behavior is what we want. @BryanCutler @jacques-n do you think this makes sense?

icexelloss · 2018-10-01T17:52:31Z

Also interested to hear from @wesm how does similar API in Python/C++ behaves.

icexelloss · 2018-10-02T18:23:03Z

gentle ping @jacques-n

wesm · 2018-10-07T17:31:15Z

Because we don't have the same builder API that Java does, we don't have this precise issue in C++. We have a DictionaryBuilder class for incremental builds of dictionary-encoded arrays. Otherwise the DictionaryArray is composed from the indices and the dictionary type

icexelloss · 2018-10-17T16:40:26Z

cc @elahrvivaz . Wes mentioned you guys might be using dictionary encoding. I wonder if you have some opinion on this?

elahrvivaz · 2018-10-18T14:13:38Z

@icexelloss this does seem unintuitive to me, but I believe that the way it works now is correct. The field type is the dictionary encoded type, not the decoded type. To see the actual decoded type, you have to look up the corresponding dictionary and get its type. I don't think that this is explicitly defined in https://github.com/apache/arrow/blob/master/format/Schema.fbs, but I believe that I initiated some discussion on it at the time. This means that any dictionary encoding has to be handled by the user, and isn't handled at all by the arrow library itself.

icexelloss · 2018-10-18T15:12:26Z

@elahrvivaz Thanks for the input. I am curious how are you creating dictionary encoded vector from schema now, can you show me your example usage?

but I believe that I initiated some discussion on it at the time

Is it possible to find the discussion somewhere?

elahrvivaz · 2018-10-18T15:43:38Z

@icexelloss I think most of the discussion I was thinking of is in the initial Java dictionary PR here: #309
There was some additional discussion in a follow-on PR for writers here, but seems less relevant from a quick skim: #334

As for creating the vectors, I was looking through our code, and I believe that we never invoke VectorSchemaRoot.create, but instead use the constructor with the vectors already defined, i.e. new VectorSchemaRoot(schema, Collections.singletonList(vector), vector.getValueCount) (see https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/SimpleFeatureArrowIO.scala#L313 for example).

icexelloss · 2018-10-18T22:11:01Z

@elahrvivaz Thanks for the explanation. I read through the discussion but I am not sure that I agree with you here..

I think the behavior of

VectorSchemaRoot.create(schema, allocator)

Should be the same as

new VectorSchemaRoot(schema, vectors, vector.getValueCount)

Since in the (2), the user passes encoded vectors for dictionary encoded field, I think (1) should also create encoded vectors for dictionary encoded field. However, (1) now creates decoded vectors and I should be fixed. Do you think otherwise?

elahrvivaz · 2018-10-19T11:53:02Z

I believe that the vectors/fields that we're passing do correspond to the same fields that would be created by VectorSchemaRoot.create. My point was that I believe that the schema you're creating should actually be:

 Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the encoded value
                new ArrowType.Int32(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

Thanks,

icexelloss · 2018-10-19T15:02:53Z

@elahrvivaz Aha I see. Thanks for the clarification. If that's the case then this PR can be closed. Let me open a separate PR to address the documentation issue.

elahrvivaz · 2018-10-22T12:38:48Z

I assume that the C++ implementation encodes the fields the same way, otherwise the integration tests would fail. It does seem a bit confusing and maybe it would be worth revisiting.

icexelloss · 2018-10-26T17:50:05Z

Back to this. From @wesm 's comment in #2798:

In C++ we have a synthetic DictionaryType that holds the dictionary / dictionary type as well as the index type. In the IPC metadata, the field type is that of the decoded type. The DictionaryEncoding table holds the index type

Since in Java, the schema in the IPC directly maps to the Schema object in Java, I think the correct usage is to pass the decoded type to the field object, i.e.:

    Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the decoded value
                new ArrowType.Utf8(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

instead of

    Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the encoded value
                new ArrowType. Int32(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

@elahrvivaz @BryanCutler do you guys agree?

wesm · 2018-10-26T18:03:19Z

This was what I was saying about things being hacky in Java. You mentioned

    // This doesn't not work without the patch
    // root.getVector("enum1") returns a VarCharVector instead of a TinyIntVector
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    TinyIntVector vector = (TinyIntVector) root.getVector("enum1");
    vector.allocateNew();

How hard would it be to create a DictionaryVector in Java? That seems like it could solve the problem.

icexelloss · 2018-10-26T18:25:43Z

@wesm I think the issue is that the Schema and Field object are mapped directly to json presentation of the schema in IPC, for example:

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L63

Mapping between the Java object and Flatbuffer object is not hard to do because it's hand written:

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L101

So it's not as easy to introduce a synthetic DictionaryType similar to C++ because I think most of the Java classes are designed to map to the IPC types. I think it's possible to decouple the Java classes from the IPC types but it could be pretty large effort.

wesm · 2018-10-26T18:28:12Z

I thought we were talking about removing the JSON dependency from Arrow and making it a test dependency instead. That's the way things are in C++. @jacques-n?

icexelloss · 2018-10-26T18:37:40Z

I would probably prefer JSON to be a dependency in arrow tools because I have been places that people want to use the JSON presentation of Arrow table for debugging purpose (for example, in a rest service that returns Arrow table).

Besides JSON issue, the Java code still has some codegen using https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/data/ArrowTypes.tdd

I think introducing a synthetic type here might add some complexity to other parts of the code. Maybe it's not too bad but without implementing it it's hard to know for sure.

jacques-n · 2018-10-26T19:47:52Z

I find this entire thread quite hard to follow. It feels like we're developing APIs without a clear set of sample usages in mind. Can we start with the example program that we want to use the APIs and discuss the key requirements? For example:

Some consumers want to interact with encoded values.
Some consumers want to interact with decoded values.
There should be tools to decorate or move from encoding to decoded values and vice versa.
The patterns for each should preferably look like ... and ...
etc...

Per JSON comments, yes, Jackson should be an optional separate module around json serialization. I would think it is entirely its own module and doesn't relate to other tools, etc.

wesm · 2018-10-26T19:53:46Z

I also had a difficult time following the thread. The title of the PR is not clear, for example. It sounds like the problem is "VectorSchemaRoot.getVector does not return the expected vector type for dictionary-encoded fields"

icexelloss · 2018-10-26T19:58:15Z

@wesm @jacques-n Sorry for the confusion. Let me clear up the problem statement a bit and update the PR description. Let's go from there. I appreciate your time providing feedbacks so far and apologies for not being very organized here.

elahrvivaz · 2018-10-29T12:46:45Z

@icexelloss I think you've hit the main sticking point, in that the code-gen, ArrowType/FieldType, as well as the FieldReader/Writer code would all have to be modified to account for dictionary-encoded fields, and there wasn't an appetite to make that kind of wholesale change when the dictionary encoding was first introduced.

Per discussion on #2681 (comment) Author: Li Jin <[email protected]> Closes #2798 from icexelloss/ARROW-3566 and squashes the following commits: 4c6ebb1 <Li Jin> type should be decoded type 594c5a4 <Li Jin> Clarify the type of dictonary encoded field

wesm · 2019-06-03T15:39:58Z

@emkornfield if you have the time and interest, dictionary-encoding in general in Java could use some TLC

emkornfield · 2019-06-06T06:34:30Z

@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you?

liyafan82 · 2019-06-06T06:48:36Z

@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you?

@emkornfield

I am interested in it, and I would like to take a look. However, I may need some time to get familiar with the related code and the background.

praveenbingo · 2019-06-06T09:00:19Z

@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you?

@emkornfield I can dedicate some review/help bandwidth, but bit stuck for the next few weeks with an internal release.

emkornfield · 2019-06-06T09:25:01Z

Thanks @liyafan82 if I understand the rough scope of change it might be fairly large, if that is the case, please discuss any proposed solution on the mailing list before getting to far into coding to make sure we can get consensus on the approach.

liyafan82 · 2019-06-06T09:41:26Z

Thanks @liyafan82 if I understand the rough scope of change it might be fairly large, if that is the case, please discuss any proposed solution on the mailing list before getting to far into coding to make sure we can get consensus on the approach.

Sure. Sounds reasonable.

liyafan82 · 2019-06-11T11:00:23Z

java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryEncodedVector.java

This should be a style problem.

Agreed, I wonder if this PR is old enough that we didn't have checkstyle enforcement?

@emkornfield , I have prepared a doc for the dictionary related use cases.

This PR solves problem 2 in the doc (misleading constructor), and it is a relatively small problem and involves small changes.

Can we first merge this PR and close this issue? It looks to me except some style problems, and some comments need to be removed.

wesm · 2019-07-17T21:46:18Z

Closing this for now, let's continue the discussion about dictionary-encoded data in Java in the mailing list

icexelloss changed the title ~~ARROW-3396: [Java][WIP] VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding~~ ARROW-3396: [WIP] [Java]VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding Oct 1, 2018

wesm changed the title ~~ARROW-3396: [WIP] [Java]VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding~~ ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding Oct 3, 2018

icexelloss mentioned this pull request Oct 19, 2018

ARROW-3566: [Format] Clarify the type of dictonary encoded field #2798

Closed

icexelloss changed the title ~~ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding~~ ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoding Oct 26, 2018

icexelloss changed the title ~~ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoding~~ ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoded field Oct 26, 2018

icexelloss changed the title ~~ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoded field~~ ARROW-3396: [WIP] [Java] VectorSchemaRoot.getVector does not return the expected vector type for dictionary-encoded fields Oct 26, 2018

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

liyafan82 reviewed Jun 11, 2019

View reviewed changes

icexelloss added 2 commits July 5, 2019 02:11

Add test dictionary encoded vector

fb92121

wip

102d9a2

emkornfield force-pushed the dictionary-encoding-test branch from 6d4808a to 102d9a2 Compare July 5, 2019 09:11

fsaintjacques added the Component: Java label Jul 12, 2019

wesm closed this Jul 17, 2019

asfimport mentioned this pull request Nov 27, 2024

[Java] VectorSchemaRoot.create(schema, allocator) doesn't create dictionary encoded vector correctly apache/arrow-java#369

Open

ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields #2681

ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields #2681

Uh oh!

Conversation

icexelloss commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Oct 1, 2018

Uh oh!

icexelloss commented Oct 1, 2018

Uh oh!

icexelloss commented Oct 2, 2018

Uh oh!

wesm commented Oct 7, 2018

Uh oh!

icexelloss commented Oct 17, 2018

Uh oh!

elahrvivaz commented Oct 18, 2018

Uh oh!

icexelloss commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elahrvivaz commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Oct 18, 2018

Uh oh!

elahrvivaz commented Oct 19, 2018

Uh oh!

icexelloss commented Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elahrvivaz commented Oct 22, 2018

Uh oh!

icexelloss commented Oct 26, 2018

Uh oh!

wesm commented Oct 26, 2018

Uh oh!

icexelloss commented Oct 26, 2018

Uh oh!

wesm commented Oct 26, 2018

Uh oh!

icexelloss commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacques-n commented Oct 26, 2018

Uh oh!

wesm commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Oct 26, 2018

Uh oh!

elahrvivaz commented Oct 29, 2018

Uh oh!

wesm commented Jun 3, 2019

Uh oh!

emkornfield commented Jun 6, 2019

Uh oh!

liyafan82 commented Jun 6, 2019

Uh oh!

praveenbingo commented Jun 6, 2019

Uh oh!

emkornfield commented Jun 6, 2019

Uh oh!

liyafan82 commented Jun 6, 2019

Uh oh!

liyafan82 Jun 11, 2019

Choose a reason for hiding this comment

Uh oh!

emkornfield Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

liyafan82 Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 17, 2019

Uh oh!

Reviewers

Assignees

Labels

icexelloss commented Oct 1, 2018 •

edited

Loading

icexelloss commented Oct 18, 2018 •

edited

Loading

elahrvivaz commented Oct 18, 2018 •

edited

Loading

icexelloss commented Oct 19, 2018 •

edited

Loading

icexelloss commented Oct 26, 2018 •

edited

Loading

wesm commented Oct 26, 2018 •

edited

Loading