Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Sep 28, 2017

This unifies the ingest path for 1D data into pyarrow.array. I added the argument from_pandas to turn null sentinel checking on or off:

In [8]: arr = np.random.randn(10000000)

In [9]: arr[::3] = np.nan

In [10]: arr2 = pa.array(arr)

In [11]: arr2.null_count
Out[11]: 0

In [12]: %timeit arr2 = pa.array(arr)
The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 68.4 µs per loop

In [13]: arr2 = pa.array(arr, from_pandas=True)

In [14]: arr2.null_count
Out[14]: 3333334

In [15]: %timeit arr2 = pa.array(arr, from_pandas=True)
1 loop, best of 3: 228 ms per loop

When the data is contiguous, it is always zero-copy, but then from_pandas=True and no null mask is passed, then a null bitmap is constructed and populated.

This also permits sequence reads into integers smaller than int64:

In [17]: pa.array([1, 2, 3, 4], type='i1')
Out[17]: 
<pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8>
[
  1,
  2,
  3,
  4
]

Oh, I also added NumPy-like string type aliases:

In [18]: pa.int32() == 'i4'
Out[18]: True

…row.Array

Change-Id: I97e785c7fd34540f2c6ba05cfaaef5b1fbf830f4
Change-Id: I17dee43549b04a06a190baf5d0996fab4d60301f
Change-Id: I484937c8eb23b96402ec6b1ec3d4342fa8dedbd4
Change-Id: I2ec4737119bf25c5f5a5ee0e760855d01daaa79b
…rray

Change-Id: Ie8e87c3c529f4071e221f390b333ad702d247c8d
Change-Id: Ic2999fba7575bc80ce71121f050ca528636a106d
@wesm
Copy link
Member Author

wesm commented Sep 29, 2017

@cpcloud or @xhochy could you review when you have a chance?

class DateConverter : public TypedConverterVisitor<Date64Builder, DateConverter> {
public:
inline Status AppendItem(const OwnedRef& item) {
PyDateTime_Date* pydate = reinterpret_cast<PyDateTime_Date*>(item.obj());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be auto.

reinterpret_cast<PyDateTime_DateTime*>(item.obj());
return typed_builder_->Append(PyDateTime_to_us(pydatetime));
}
PyDateTime_DateTime* pydatetime = reinterpret_cast<PyDateTime_DateTime*>(item.obj());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be auto.


def array(object obj, type=None, mask=None,
MemoryPool memory_pool=None, size=None,
from_pandas=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me how mask and from_pandas interact. If mask[i] = False, from_pandas == True, and obj[i] is NaN what are the semantics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the C++ for NumPyConverter, it appears that mask takes priority over from_pandas. Is it possible that we just use mask to cover both cases and pass series.isnull() to mask if obj is a Series? This does allocate an additional array, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would harm performance. At least it should be better documented, let me do that


def test_type_for_alias():
cases = [
('i1', pa.int8()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally am not a huge fan of the i* and u* numpy aliases because they aren't very explicit and they use different units than their more verbose counterparts. NumPy also has u8 and U8 which mean completely different things. I guess for maximum compatibility they are useful, but not for much else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. We should use the longhand versions in any real code

def __richcmp__(DataType self, DataType other, int op):
def __richcmp__(DataType self, object other, int op):
cdef DataType other_type
if not isinstance(other, DataType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably check for basestring/str only and raise otherwise to prevent any weird red herring errors from comparisons returning a meaningless False value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# specific language governing permissions and limitations
# under the License.

import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Change-Id: I371b9a8c30a7deaad41ccb729a26983de9a39ee6
@wesm
Copy link
Member Author

wesm commented Sep 30, 2017

Thanks for the review. I'll merge once the build clears

@wesm
Copy link
Member Author

wesm commented Sep 30, 2017

+1

@asfgit asfgit closed this in 796129b Sep 30, 2017
@wesm wesm deleted the expand-py-array-method branch September 30, 2017 04:15
wesm added a commit to wesm/arrow that referenced this pull request Oct 3, 2017
…riginating in pandas

This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off:

```
In [8]: arr = np.random.randn(10000000)

In [9]: arr[::3] = np.nan

In [10]: arr2 = pa.array(arr)

In [11]: arr2.null_count
Out[11]: 0

In [12]: %timeit arr2 = pa.array(arr)
The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 68.4 µs per loop

In [13]: arr2 = pa.array(arr, from_pandas=True)

In [14]: arr2.null_count
Out[14]: 3333334

In [15]: %timeit arr2 = pa.array(arr, from_pandas=True)
1 loop, best of 3: 228 ms per loop
```

When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated.

This also permits sequence reads into integers smaller than int64:

```
In [17]: pa.array([1, 2, 3, 4], type='i1')
Out[17]:
<pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8>
[
  1,
  2,
  3,
  4
]
```

Oh, I also added NumPy-like string type aliases:

```
In [18]: pa.int32() == 'i4'
Out[18]: True
```

Author: Wes McKinney <[email protected]>

Closes apache#1146 from wesm/expand-py-array-method and squashes the following commits:

1570e52 [Wes McKinney] Code review comments
d3bbb3c [Wes McKinney] Handle type aliases in cast, too
797f015 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array
f2802fc [Wes McKinney] Cleaner codepath for numpy->arrow conversions
587c575 [Wes McKinney] Add direct types sequence converters for more data types
cf40b76 [Wes McKinney] Add type aliases, some unit tests
7b530e4 [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants