-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas #1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…row.Array Change-Id: I97e785c7fd34540f2c6ba05cfaaef5b1fbf830f4
Change-Id: I17dee43549b04a06a190baf5d0996fab4d60301f
Change-Id: I484937c8eb23b96402ec6b1ec3d4342fa8dedbd4
Change-Id: I2ec4737119bf25c5f5a5ee0e760855d01daaa79b
…rray Change-Id: Ie8e87c3c529f4071e221f390b333ad702d247c8d
Change-Id: Ic2999fba7575bc80ce71121f050ca528636a106d
| class DateConverter : public TypedConverterVisitor<Date64Builder, DateConverter> { | ||
| public: | ||
| inline Status AppendItem(const OwnedRef& item) { | ||
| PyDateTime_Date* pydate = reinterpret_cast<PyDateTime_Date*>(item.obj()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be auto.
| reinterpret_cast<PyDateTime_DateTime*>(item.obj()); | ||
| return typed_builder_->Append(PyDateTime_to_us(pydatetime)); | ||
| } | ||
| PyDateTime_DateTime* pydatetime = reinterpret_cast<PyDateTime_DateTime*>(item.obj()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be auto.
|
|
||
| def array(object obj, type=None, mask=None, | ||
| MemoryPool memory_pool=None, size=None, | ||
| from_pandas=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me how mask and from_pandas interact. If mask[i] = False, from_pandas == True, and obj[i] is NaN what are the semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the C++ for NumPyConverter, it appears that mask takes priority over from_pandas. Is it possible that we just use mask to cover both cases and pass series.isnull() to mask if obj is a Series? This does allocate an additional array, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would harm performance. At least it should be better documented, let me do that
|
|
||
| def test_type_for_alias(): | ||
| cases = [ | ||
| ('i1', pa.int8()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally am not a huge fan of the i* and u* numpy aliases because they aren't very explicit and they use different units than their more verbose counterparts. NumPy also has u8 and U8 which mean completely different things. I guess for maximum compatibility they are useful, but not for much else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. We should use the longhand versions in any real code
| def __richcmp__(DataType self, DataType other, int op): | ||
| def __richcmp__(DataType self, object other, int op): | ||
| cdef DataType other_type | ||
| if not isinstance(other, DataType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably check for basestring/str only and raise otherwise to prevent any weird red herring errors from comparisons returning a meaningless False value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/types.pxi
Outdated
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Change-Id: I371b9a8c30a7deaad41ccb729a26983de9a39ee6
|
Thanks for the review. I'll merge once the build clears |
|
+1 |
…riginating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <[email protected]> Closes apache#1146 from wesm/expand-py-array-method and squashes the following commits: 1570e52 [Wes McKinney] Code review comments d3bbb3c [Wes McKinney] Handle type aliases in cast, too 797f015 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575 [Wes McKinney] Add direct types sequence converters for more data types cf40b76 [Wes McKinney] Add type aliases, some unit tests 7b530e4 [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array
This unifies the ingest path for 1D data into
pyarrow.array. I added the argumentfrom_pandasto turn null sentinel checking on or off:When the data is contiguous, it is always zero-copy, but then
from_pandas=Trueand no null mask is passed, then a null bitmap is constructed and populated.This also permits sequence reads into integers smaller than int64:
Oh, I also added NumPy-like string type aliases: