ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas #1146

wesm · 2017-09-28T22:56:31Z

This unifies the ingest path for 1D data into pyarrow.array. I added the argument from_pandas to turn null sentinel checking on or off:

In [8]: arr = np.random.randn(10000000)

In [9]: arr[::3] = np.nan

In [10]: arr2 = pa.array(arr)

In [11]: arr2.null_count
Out[11]: 0

In [12]: %timeit arr2 = pa.array(arr)
The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 68.4 µs per loop

In [13]: arr2 = pa.array(arr, from_pandas=True)

In [14]: arr2.null_count
Out[14]: 3333334

In [15]: %timeit arr2 = pa.array(arr, from_pandas=True)
1 loop, best of 3: 228 ms per loop

When the data is contiguous, it is always zero-copy, but then from_pandas=True and no null mask is passed, then a null bitmap is constructed and populated.

This also permits sequence reads into integers smaller than int64:

In [17]: pa.array([1, 2, 3, 4], type='i1')
Out[17]: 
<pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8>
[
  1,
  2,
  3,
  4
]

Oh, I also added NumPy-like string type aliases:

In [18]: pa.int32() == 'i4'
Out[18]: True

…row.Array Change-Id: I97e785c7fd34540f2c6ba05cfaaef5b1fbf830f4

Change-Id: I17dee43549b04a06a190baf5d0996fab4d60301f

Change-Id: I484937c8eb23b96402ec6b1ec3d4342fa8dedbd4

Change-Id: I2ec4737119bf25c5f5a5ee0e760855d01daaa79b

…rray Change-Id: Ie8e87c3c529f4071e221f390b333ad702d247c8d

Change-Id: Ic2999fba7575bc80ce71121f050ca528636a106d

wesm · 2017-09-29T15:26:42Z

@cpcloud or @xhochy could you review when you have a chance?

cpcloud · 2017-09-29T22:45:45Z

cpp/src/arrow/python/builtin_convert.cc

+class DateConverter : public TypedConverterVisitor<Date64Builder, DateConverter> {
+ public:
+  inline Status AppendItem(const OwnedRef& item) {
+    PyDateTime_Date* pydate = reinterpret_cast<PyDateTime_Date*>(item.obj());


Could be auto.

cpcloud · 2017-09-29T22:46:01Z

cpp/src/arrow/python/builtin_convert.cc

-          reinterpret_cast<PyDateTime_DateTime*>(item.obj());
-      return typed_builder_->Append(PyDateTime_to_us(pydatetime));
-    }
+    PyDateTime_DateTime* pydatetime = reinterpret_cast<PyDateTime_DateTime*>(item.obj());


Could be auto.

cpcloud · 2017-09-29T22:52:35Z

python/pyarrow/array.pxi

+
+def array(object obj, type=None, mask=None,
+          MemoryPool memory_pool=None, size=None,
+          from_pandas=False):


It's not obvious to me how mask and from_pandas interact. If mask[i] = False, from_pandas == True, and obj[i] is NaN what are the semantics?

After reading the C++ for NumPyConverter, it appears that mask takes priority over from_pandas. Is it possible that we just use mask to cover both cases and pass series.isnull() to mask if obj is a Series? This does allocate an additional array, though.

Yeah, that would harm performance. At least it should be better documented, let me do that

cpcloud · 2017-09-29T23:03:32Z

python/pyarrow/tests/test_schema.py


+def test_type_for_alias():
+    cases = [
+        ('i1', pa.int8()),


I personally am not a huge fan of the i* and u* numpy aliases because they aren't very explicit and they use different units than their more verbose counterparts. NumPy also has u8 and U8 which mean completely different things. I guess for maximum compatibility they are useful, but not for much else.

agreed. We should use the longhand versions in any real code

cpcloud · 2017-09-29T23:05:22Z

python/pyarrow/types.pxi

-    def __richcmp__(DataType self, DataType other, int op):
+    def __richcmp__(DataType self, object other, int op):
+        cdef DataType other_type
+        if not isinstance(other, DataType):


This should probably check for basestring/str only and raise otherwise to prevent any weird red herring errors from comparisons returning a meaningless False value.

cpcloud · 2017-09-29T23:08:06Z

python/pyarrow/types.pxi

 # specific language governing permissions and limitations
 # under the License.

+import re


This doesn't appear to be used anywhere.

Change-Id: I371b9a8c30a7deaad41ccb729a26983de9a39ee6

wesm · 2017-09-30T01:35:10Z

Thanks for the review. I'll merge once the build clears

wesm · 2017-09-30T04:02:45Z

+1

…riginating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <[email protected]> Closes apache#1146 from wesm/expand-py-array-method and squashes the following commits: 1570e52 [Wes McKinney] Code review comments d3bbb3c [Wes McKinney] Handle type aliases in cast, too 797f015 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575 [Wes McKinney] Add direct types sequence converters for more data types cf40b76 [Wes McKinney] Add type aliases, some unit tests 7b530e4 [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array

wesm added 6 commits September 27, 2017 23:34

Consolidate both sequence and ndarray/Series/Index conversion in pyar…

7b530e4

…row.Array Change-Id: I97e785c7fd34540f2c6ba05cfaaef5b1fbf830f4

Add type aliases, some unit tests

cf40b76

Change-Id: I17dee43549b04a06a190baf5d0996fab4d60301f

Add direct types sequence converters for more data types

587c575

Change-Id: I484937c8eb23b96402ec6b1ec3d4342fa8dedbd4

Cleaner codepath for numpy->arrow conversions

f2802fc

Change-Id: I2ec4737119bf25c5f5a5ee0e760855d01daaa79b

Allow null checking to be skipped with from_pandas=False in pyarrow.a…

797f015

…rray Change-Id: Ie8e87c3c529f4071e221f390b333ad702d247c8d

Handle type aliases in cast, too

d3bbb3c

Change-Id: Ic2999fba7575bc80ce71121f050ca528636a106d

cpcloud reviewed Sep 29, 2017

View reviewed changes

Code review comments

1570e52

Change-Id: I371b9a8c30a7deaad41ccb729a26983de9a39ee6

asfgit closed this in 796129b Sep 30, 2017

wesm deleted the expand-py-array-method branch September 30, 2017 04:15

asfimport mentioned this pull request Sep 30, 2017

[Python] Efficient construction of arrays from non-pandas 1D NumPy arrays #16446

Closed

ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas #1146

ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas #1146

Uh oh!

Conversation

wesm commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Sep 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Sep 30, 2017

Uh oh!

wesm commented Sep 30, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wesm commented Sep 28, 2017 •

edited

Loading