Skip to content

Conversation

@icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Oct 1, 2018

When creating a dictionary encoded vector from Schema, I expect that a vector of encoded type is created, however, currently will create a decoded vector, i.e.:

    // Create the dictionary
    VarCharVector dictVector = (VarCharVector)
        FieldType.nullable(new ArrowType.Utf8())
            .createNewSingleVector("enum1_dict", allocator, null);
    dictVector.allocateNew();
    dictVector.set(0, "foo".getBytes(StandardCharsets.UTF_8));
    dictVector.set(1, "bar".getBytes(StandardCharsets.UTF_8));
    dictVector.setValueCount(2);

    DictionaryEncoding enum1Encoding = new DictionaryEncoding(
        0,
        true,
        new ArrowType.Int(8, true)
    );

    DictionaryProvider dictionaryProvider = new DictionaryProvider.MapDictionaryProvider(
        new Dictionary(dictVector, enum1Encoding)
    );

    // Create the dictionary encoded vector
    Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the decoded value
                new ArrowType.Utf8(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

    // This doesn't not work without the patch
    // root.getVector("enum1") returns a VarCharVector instead of a TinyIntVector
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    TinyIntVector vector = (TinyIntVector) root.getVector("enum1");
    vector.allocateNew();

I expect to get an encoded vector because that's the underlying memory representation. I also got an encoded vector when consuming a dictionary encoded vector so I think on the producer side it should be the same.

@icexelloss icexelloss changed the title ARROW-3396: [Java][WIP] VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding ARROW-3396: [WIP] [Java]VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding Oct 1, 2018
@icexelloss
Copy link
Contributor Author

This is a WIP patch but I first want to make sure the behavior is what we want. @BryanCutler @jacques-n do you think this makes sense?

@icexelloss
Copy link
Contributor Author

Also interested to hear from @wesm how does similar API in Python/C++ behaves.

@icexelloss
Copy link
Contributor Author

gentle ping @jacques-n

@wesm wesm changed the title ARROW-3396: [WIP] [Java]VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding Oct 3, 2018
@wesm
Copy link
Member

wesm commented Oct 7, 2018

Because we don't have the same builder API that Java does, we don't have this precise issue in C++. We have a DictionaryBuilder class for incremental builds of dictionary-encoded arrays. Otherwise the DictionaryArray is composed from the indices and the dictionary type

@icexelloss
Copy link
Contributor Author

cc @elahrvivaz . Wes mentioned you guys might be using dictionary encoding. I wonder if you have some opinion on this?

@elahrvivaz
Copy link
Contributor

@icexelloss this does seem unintuitive to me, but I believe that the way it works now is correct. The field type is the dictionary encoded type, not the decoded type. To see the actual decoded type, you have to look up the corresponding dictionary and get its type. I don't think that this is explicitly defined in https://github.com/apache/arrow/blob/master/format/Schema.fbs, but I believe that I initiated some discussion on it at the time. This means that any dictionary encoding has to be handled by the user, and isn't handled at all by the arrow library itself.

@icexelloss
Copy link
Contributor Author

icexelloss commented Oct 18, 2018

@elahrvivaz Thanks for the input. I am curious how are you creating dictionary encoded vector from schema now, can you show me your example usage?

but I believe that I initiated some discussion on it at the time

Is it possible to find the discussion somewhere?

@elahrvivaz
Copy link
Contributor

elahrvivaz commented Oct 18, 2018

@icexelloss I think most of the discussion I was thinking of is in the initial Java dictionary PR here: #309
There was some additional discussion in a follow-on PR for writers here, but seems less relevant from a quick skim: #334

As for creating the vectors, I was looking through our code, and I believe that we never invoke VectorSchemaRoot.create, but instead use the constructor with the vectors already defined, i.e. new VectorSchemaRoot(schema, Collections.singletonList(vector), vector.getValueCount) (see https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/SimpleFeatureArrowIO.scala#L313 for example).

@icexelloss
Copy link
Contributor Author

@elahrvivaz Thanks for the explanation. I read through the discussion but I am not sure that I agree with you here..

I think the behavior of

VectorSchemaRoot.create(schema, allocator)

Should be the same as

new VectorSchemaRoot(schema, vectors, vector.getValueCount)

Since in the (2), the user passes encoded vectors for dictionary encoded field, I think (1) should also create encoded vectors for dictionary encoded field. However, (1) now creates decoded vectors and I should be fixed. Do you think otherwise?

@elahrvivaz
Copy link
Contributor

I believe that the vectors/fields that we're passing do correspond to the same fields that would be created by VectorSchemaRoot.create. My point was that I believe that the schema you're creating should actually be:

 Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the encoded value
                new ArrowType.Int32(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

Thanks,

@icexelloss
Copy link
Contributor Author

icexelloss commented Oct 19, 2018

@elahrvivaz Aha I see. Thanks for the clarification. If that's the case then this PR can be closed. Let me open a separate PR to address the documentation issue.

@elahrvivaz
Copy link
Contributor

I assume that the C++ implementation encodes the fields the same way, otherwise the integration tests would fail. It does seem a bit confusing and maybe it would be worth revisiting.

@icexelloss
Copy link
Contributor Author

Back to this. From @wesm 's comment in #2798:

In C++ we have a synthetic DictionaryType that holds the dictionary / dictionary type as well as the index type. In the IPC metadata, the field type is that of the decoded type. The DictionaryEncoding table holds the index type

Since in Java, the schema in the IPC directly maps to the Schema object in Java, I think the correct usage is to pass the decoded type to the field object, i.e.:

    Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the decoded value
                new ArrowType.Utf8(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

instead of

    Schema schema = new Schema(
        asList(
            new Field(
                "enum1",
                true,
                // this is the type of the encoded value
                new ArrowType. Int32(),
                enum1Encoding,
                Collections.<Field>emptyList())
    ));

@elahrvivaz @BryanCutler do you guys agree?

@wesm
Copy link
Member

wesm commented Oct 26, 2018

This was what I was saying about things being hacky in Java. You mentioned

    // This doesn't not work without the patch
    // root.getVector("enum1") returns a VarCharVector instead of a TinyIntVector
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    TinyIntVector vector = (TinyIntVector) root.getVector("enum1");
    vector.allocateNew();

How hard would it be to create a DictionaryVector in Java? That seems like it could solve the problem.

@icexelloss
Copy link
Contributor Author

@wesm I think the issue is that the Schema and Field object are mapped directly to json presentation of the schema in IPC, for example:

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L63

Mapping between the Java object and Flatbuffer object is not hard to do because it's hand written:

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L101

So it's not as easy to introduce a synthetic DictionaryType similar to C++ because I think most of the Java classes are designed to map to the IPC types. I think it's possible to decouple the Java classes from the IPC types but it could be pretty large effort.

@wesm
Copy link
Member

wesm commented Oct 26, 2018

I thought we were talking about removing the JSON dependency from Arrow and making it a test dependency instead. That's the way things are in C++. @jacques-n?

@icexelloss
Copy link
Contributor Author

icexelloss commented Oct 26, 2018

I would probably prefer JSON to be a dependency in arrow tools because I have been places that people want to use the JSON presentation of Arrow table for debugging purpose (for example, in a rest service that returns Arrow table).

Besides JSON issue, the Java code still has some codegen using https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/data/ArrowTypes.tdd

I think introducing a synthetic type here might add some complexity to other parts of the code. Maybe it's not too bad but without implementing it it's hard to know for sure.

@jacques-n
Copy link
Contributor

I find this entire thread quite hard to follow. It feels like we're developing APIs without a clear set of sample usages in mind. Can we start with the example program that we want to use the APIs and discuss the key requirements? For example:

  • Some consumers want to interact with encoded values.
  • Some consumers want to interact with decoded values.
  • There should be tools to decorate or move from encoding to decoded values and vice versa.
  • The patterns for each should preferably look like ... and ...
  • etc...

Per JSON comments, yes, Jackson should be an optional separate module around json serialization. I would think it is entirely its own module and doesn't relate to other tools, etc.

@wesm
Copy link
Member

wesm commented Oct 26, 2018

I also had a difficult time following the thread. The title of the PR is not clear, for example. It sounds like the problem is "VectorSchemaRoot.getVector does not return the expected vector type for dictionary-encoded fields"

@icexelloss icexelloss changed the title ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should encoded vector for dictionary encoding ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoding Oct 26, 2018
@icexelloss icexelloss changed the title ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoding ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoded field Oct 26, 2018
@icexelloss icexelloss changed the title ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) should return encoded vector for dictionary encoded field ARROW-3396: [WIP] [Java] VectorSchemaRoot.getVector does not return the expected vector type for dictionary-encoded fields Oct 26, 2018
@icexelloss icexelloss changed the title ARROW-3396: [WIP] [Java] VectorSchemaRoot.getVector does not return the expected vector type for dictionary-encoded fields ARROW-3396: [WIP] [Java] VectorSchemaRoot.create(schema, allocator) does not create the expected vector type for dictionary-encoded fields Oct 26, 2018
@icexelloss
Copy link
Contributor Author

@wesm @jacques-n Sorry for the confusion. Let me clear up the problem statement a bit and update the PR description. Let's go from there. I appreciate your time providing feedbacks so far and apologies for not being very organized here.

@elahrvivaz
Copy link
Contributor

@icexelloss I think you've hit the main sticking point, in that the code-gen, ArrowType/FieldType, as well as the FieldReader/Writer code would all have to be modified to account for dictionary-encoded fields, and there wasn't an appetite to make that kind of wholesale change when the dictionary encoding was first introduced.

wesm pushed a commit that referenced this pull request Oct 30, 2018
Per discussion on #2681 (comment)

Author: Li Jin <[email protected]>

Closes #2798 from icexelloss/ARROW-3566 and squashes the following commits:

4c6ebb1 <Li Jin> type should be decoded type
594c5a4 <Li Jin> Clarify the type of dictonary encoded field
@wesm
Copy link
Member

wesm commented Jun 3, 2019

@emkornfield if you have the time and interest, dictionary-encoding in general in Java could use some TLC

@emkornfield
Copy link
Contributor

@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you?

@liyafan82
Copy link
Contributor

@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you?

@emkornfield

I am interested in it, and I would like to take a look. However, I may need some time to get familiar with the related code and the background.

@praveenbingo
Copy link
Contributor

@wesm, I probably won't be able to get to this for at least a few weeks. @liyafan82 or @praveenbingo is this of interest to either of you?

@emkornfield I can dedicate some review/help bandwidth, but bit stuck for the next few weeks with an internal release.

@emkornfield
Copy link
Contributor

Thanks @liyafan82 if I understand the rough scope of change it might be fairly large, if that is the case, please discuss any proposed solution on the mailing list before getting to far into coding to make sure we can get consensus on the approach.

@liyafan82
Copy link
Contributor

Thanks @liyafan82 if I understand the rough scope of change it might be fairly large, if that is the case, please discuss any proposed solution on the mailing list before getting to far into coding to make sure we can get consensus on the approach.

Sure. Sounds reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a style problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I wonder if this PR is old enough that we didn't have checkstyle enforcement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield , I have prepared a doc for the dictionary related use cases.

This PR solves problem 2 in the doc (misleading constructor), and it is a relatively small problem and involves small changes.

Can we first merge this PR and close this issue? It looks to me except some style problems, and some comments need to be removed.

@wesm
Copy link
Member

wesm commented Jul 17, 2019

Closing this for now, let's continue the discussion about dictionary-encoded data in Java in the mailing list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants