-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Milestone
Description
min
arr = pa.array(['A'] * 16)
arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
arr2.validate(full=True)Expected results: a valid array
Actual results: pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63
So if you run arr.diff(arr2), you'll get something like:
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
Aborted (core dumped)This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:
def replace_substring_regex_workaround_12774(
array: pa.Array,
*,
pattern: str,
replacement: str
) -> pa.Array:
if len(array) > 0 and len(array) % 16 == 0:
chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
return pa.compute.replace_substring_regex(
chunked_array,
pattern=pattern,
replacement=replacement
).combine_chunks()
else:
return pa.compute.replace_substring_regex(
array,
pattern=pattern,
replacement=replacement
)Reporter: Adam Hooper / @adamhooper
Assignee: Niranda Perera / @nirandaperera
Related issues:
- [Python] compute.replace_substring_regex sometimes returns incorrect offsets, causing crashes/ub (supercedes)
PRs and other links:
Note: This issue was originally created as ARROW-12774. Please see the migration documentation for further details.