Skip to content

Conversation

@icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Oct 19, 2018

Per discussion on #2681 (comment)

@icexelloss icexelloss changed the title Clarify the type of dictonary encoded field ARROW-3566: Clarify the type of dictonary encoded field Oct 19, 2018
@icexelloss
Copy link
Contributor Author

cc @wesm @elahrvivaz

@elahrvivaz
Copy link
Contributor

@wesm I assume the C++ implementation encodes dictionary field types the same way, otherwise we wouldn't be passing the dictionary vector integration tests

// Name is not required, in i.e. a List
name: string;
nullable: bool;
// This is the type of the index if the field is dictionary encoded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. In Java, when a dictionary Field is written it first calls DictionaryUtility.toMessageFormat here which changes the Field to be the dictionary type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, that was the bridge to fix the gap between the java API and the IPC format. If I remember now, people didn't want to embed dictionary-related logic throughout the codebase, which would be required if the internal java field type has the dictionary type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have checked the integration tests and it seems @BryanCutler is correct, the type here is the type of decoded type, not the index type:

This is taken from the generated json file in integration test

      {
        "name": "dict1_0",
        "type": {
          "name": "utf8"
        },
        "nullable": true,
        "children": [],
        "dictionary": {
          "id": 0,
          "indexType": {
            "name": "int",
            "isSigned": true,
            "bitWidth": 8
          },
          "isOrdered": false
        }
      },
      ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm Could you please help confirm what the proper type should be here?

@icexelloss
Copy link
Contributor Author

Ok I have checked the integration tests and it seems the type here is the type of decoded type, not the index type:

This is taken from the generated json file in integration test:

      {
        "name": "dict1_0",
        "type": {
          "name": "utf8"
        },
        "nullable": true,
        "children": [],
        "dictionary": {
          "id": 0,
          "indexType": {
            "name": "int",
            "isSigned": true,
            "bitWidth": 8
          },
          "isOrdered": false
        }
      },
      ...

@wesm Would you please help clarify what the correct type is here?

@wesm
Copy link
Member

wesm commented Oct 24, 2018

In C++ we have a synthetic DictionaryType that holds the dictionary / dictionary type as well as the index type. In the IPC metadata, the field type is that of the decoded type. The DictionaryEncoding table holds the index type

@icexelloss
Copy link
Contributor Author

@wesm Thanks for the clarification. Then I think this PR should be good to go.

I will need to go back to #2681 to see what to do there. Thanks all !

@wesm wesm changed the title ARROW-3566: Clarify the type of dictonary encoded field ARROW-3566: [Format] Clarify the type of dictonary encoded field Oct 30, 2018
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants