-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-42218: [Java] MapVector cannot be loaded via IPC #43014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add mutable global state
|
Also, we should test this via IPC and C Data |
Would it be okay if we use |
java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be if there are 0 children? If there's any other number of children presumably it should fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so better to check size==0 and assign the KEY_NAME else, the field from the struct. Also for value, I think the most accurate check would be size < 2, because there could be no key defined, then there could be 1 child (when key was added).
java/vector/src/test/java/org/apache/arrow/vector/TestMapVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestMapVector.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see IPC and C Data integration tests with other languages, not within Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm this was an old issue filed for this purpose and we have a C Data interface test being skipped because of that. https://github.com/apache/arrow/issues/24869
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have enabled the non_canonical_map tests in archery integration tests. But I am getting one failure as shown below.
RuntimeError: Command failed: java -Dio.netty.tryReflectionSetAccessible=true -Darrow.struct.conflict.policy=CONFLICT_APPEND -XX:-UsePerfData --add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED --add-reads=org.apache.arrow.flight.core=ALL-UNNAMED -cp /home/vibhatha/github/fork/arrow/java/tools/target/arrow-tools-17.0.0-SNAPSHOT-jar-with-dependencies.jar org.apache.arrow.tools.Integration -a /tmp/tmpj67jyv0g/8bbfe7eb_generated_map_non_canonical.consumer_stream_as_file -j /tmp/arrow-integration-7d6i1uz7/generated_map_non_canonical.json -c VALIDATE
With output:
--------------
WARNING: Unknown module: org.apache.arrow.flight.core specified to --add-reads
WARNING: Unknown module: org.apache.arrow.memory.core specified to --add-opens
SLF4J(W): No SLF4J providers were found.
SLF4J(W): Defaulting to no-operation (NOP) logger implementation
SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
Incompatible files
Different schemas:
Schema<map_other_names: Map(false)<entries: Struct<key: Utf8 not null, value: Int(32, true)> not null>>
Schema<map_other_names: Map(false)<some_entries: Struct<some_key: Utf8 not null, some_value: Int(32, true)> not null>>I checked two things,
- The schema of the Arrow file is
Schema<map_other_names: Map(false)<entries: Struct<key: Utf8 not null, value: Int(32, true)> not null>>
Verified by just reading the schema of the loaded file.
- I independently checked whether Java JsonReader/Writer can write custom key and value fields.
It can write and read custom fields, I think that's why the Java Producing and Java consuming passes.
Also Java Producing and C++ consuming tests are passing.
But in C++ producing and Java consuming, I get the above error. Also in 1 I checked the schema, and it is what is being noted
Schema<map_other_names: Map(false)<entries: Struct<key: Utf8 not null, value: Int(32, true)> not null>>.
I have a doubt as it seems C++ is not producing the custom schema? Am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Canonical map names are restored on import, so the schemas are unequal
.skip_format(SKIP_C_SCHEMA, 'C++')
.skip_format(SKIP_C_SCHEMA, 'Java'),
there's a comment presumably referencing this. Maybe @pitrou can clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the Java line, though still fails. @pitrou appreciate your feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ only compares map field names if the check_metadata option is enabled (it's disabled by default). Java should probably do something similar.
arrow/cpp/src/arrow/compare.cc
Lines 791 to 796 in 8fc40fc
| if (check_metadata_ && (left.item_field()->name() != right.item_field()->name() || | |
| left.key_field()->name() != right.key_field()->name() || | |
| left.value_field()->name() != right.value_field()->name())) { | |
| result_ = false; | |
| return Status::OK(); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pitrou for investigating this. I will take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this, Java uses, the Integration.java program to do this validation as used by the archery. And in the program there is a schema comparison, just by using the content in the generated JSON files and arrow files. So this anyways should fail right? Because the JSON file generated has one schema, and the IPC file has another schema as the loaded data doesn't contain the canonical names in C++?
JSON Schema
"schema": {
"fields": [
{
"name": "map_other_names",
"type": {
"name": "map",
"keysSorted": false
},
"nullable": true,
"children": [
{
"name": "some_entries",
"type": {
"name": "struct"
},
"nullable": false,
"children": [
{
"name": "some_key",
"type": {
"name": "utf8"
},
"nullable": false,
"children": []
},
{
"name": "some_value",
"type": {
"name": "int",
"isSigned": true,
"bitWidth": 32
},
"nullable": true,
"children": []
}
]
}
]
}
]
},
IPC Content
>>> import pyarrow as pa
>>> file_path = "/tmp/tmppa0h4f3m/7a269ab8_generated_map_non_canonical.consumer_stream_as_file"
>>> with pa.OSFile(file_path, 'rb') as source:
... loaded_array = pa.ipc.open_file(source).read_all()
...
>>> loaded_array
pyarrow.Table
map_other_names: map<string, int32>
child 0, entries: struct<key: string not null, value: int32> not null
child 0, key: string not null
child 1, value: int32
----
map_other_names: [[keys:["mw3gônj"]values:[null],null,keys:["矢gc6h43","4n矢3°5€"]values:[2147483647,-99106826],keys:["or3iµg£","naadofp","°dfhjrl","hôr£µn£"]values:[1403778175,401427101,null,1171070133],keys:["i2nk4oô"]values:[1754612069],keys:["wiôgid6","dÂom23r"]values:[1528772736,-696511786],keys:["rdÂ3n1r","wµrôÂfc","kfakmko","°2Â3hÂ"]values:[null,1836582554,-1502317905,153924375]]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vibhatha I'll repeat what I said above: C++ only compares map field names if the check_metadata option is enabled (it's disabled by default). Java should probably do something similar.
Note that Integration.java calls Validator.compareSchemas...
|
|
|
Could you open a new PR in apache/arrow-java? |
|
Sure I will look into it. |
|
|
Rationale for this change
The
MapVectorkeeps theKEY_NAMEandVALUE_NAMEas a constant values but rather it should be extracted from the provided fields.What changes are included in this PR?
Updating
KEY_NAMEandVALUE_NAMEfrom the provided fields. Adding a test case to validate that.Are these changes tested?
Yes. A test case has been added to validate this.
Are there any user-facing changes?
No