-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-40407: [JS] Fix string coercion in MapRowProxyHandler.ownKeys #40408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
js/src/row/map.ts
Outdated
| return Array.from(row[kKeys].toArray(), String); | ||
| } | ||
| has(row: MapRow<K, V>, key: string | symbol) { | ||
| return row[kKeys].includes(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I’m not fixing this in this PR, but I suspect that there’s also a bug here: if key is a string, and rows[kKeys] is a typed array, then includes(key) will always return false. You’d need to coerce the key to a number, or coerce rows[kKeys] to strings, for this test to return true. Same with other usage of row[kKeys].indexOf(key) below, I expect.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbostock any reason not to fix it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not familiar enough with this codebase to do it confidently and to test. I’d welcome someone else doing it.
|
LGTM, thanks @mbostock! |
|
Woot, thanks for the approval @trxcllnt. I’m excited to contribute. 😁 |
|
IIRC the Map keys are required to be strings (tho I'm not sure we prevent someone from using a different type). Or has newer Arrow has loosened this requirement? That's the only reason I can think we'd have made the assumption the keys would always be strings here. |
|
I personally prefer supporting maps with keys of any Arrow dtype, but not sure how compatible with other implementations that will be. I do see the integration tests seem to only test maps with string keys. |
|
Python (which uses C++ which could be considered the canonical implementation) only allows string keys. Traceback (most recent call last):
File "/Users/dominik/Code/ramsch/map.py", line 8, in <module>
table = pa.table({'a': range(10), 'b': np.random.randn(10), 'c': [{x: x} for x in range(10)]})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 5204, in pyarrow.lib.table
File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected dict key of type str or bytes, got 'int'Rust doesn't seem to enforce string keys and we couldn't find anything in the spec so it seems to be a gray area right now. |
|
Yeah, I'm inclined to tell DuckDB their Map implementation is non-conformant. Generally whatever Even if JS allows non-string keys, other implementations probably won't. This has been a significant source of confusion in the past, so we try to align with C++ as much as possible. |
trxcllnt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
It is absolutely not the case that maps must have string keys. As just one example, part of the definition of ADBC metadata includes a |
|
You're right, I don't see C++ checking the keys type in |
|
This is not how you would create a Map array with PyArrow. Example: >>> ty = pa.map_(pa.int8(), pa.int32())
>>> a = pa.array([{1: 1000, 2: 10000}, {3: -1000}], type=ty)
>>> a
<pyarrow.lib.MapArray object at 0x7f7ab948dd80>
[
keys:
[
1,
2
]
values:
[
1000,
10000
],
keys:
[
3
]
values:
[
-1000
]
] |
|
Oh, |
|
Anything else I can do to help this land? Seems like we still want this fix. |
|
Apologies for bumping, but I’d love to help move this forward as this is currently preventing us from using DuckDB’s |
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 117460b. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
The
ProxyHandler.ownKeysimplementation must return strings or symbols. Because of this bug, it was returning numbers, causing theinoperator to crash when trying to iterate over the keys of aMapRowobject.An example of this is a DuckDB SQL query using the
HISTOGRAMoperator:What changes are included in this PR?
Instead of calling
array.map(String), which returns a typed array of non-strings whenarrayis a typed array, callArray.fromwhich is guaranteed to return strings.Are these changes tested?
Apologies, but I don’t know how to test this.
Are there any user-facing changes?
This fixes a crash when using the
inoperator on aMapRowobject.