Skip to content

Conversation

@arw2019
Copy link
Contributor

@arw2019 arw2019 commented Aug 25, 2020

This PR homogenizes error messages for mixed-type Pandas inputs to pa.Table.

The message for Pandas column with int followed by string is now

In [2]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19, 'a']}))                                                                                                                                             
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to int', 'Conversion failed for column a with type object')

the same as for double followed by string:

In [3]: table = pa.Table.from_pandas(pd.DataFrame({'a': [ 19.0, 'a']}))                                                                                                                                             
(... traceback...)
ArrowInvalid: ('Could not convert a with type str: tried to convert to double', 'Conversion failed for column a with type object')

As a side effect, this snippet [xref #5866, ARROW-7168] now throws an ArrowInvalid (has been FutureWarning since 0.16):

In [8]: cat = pd.Categorical.from_codes(np.array([0, 1], dtype='int8'), np.array(['a', 'b'], dtype=object)) 
   ...: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64()) 
   ...: result = pa.array(cat, type=typ)                                                                                                                                                                          
(... traceback...)
ArrowInvalid: Could not convert a with type str: tried to convert to int

Finally, this does break a test [xref #4484, ARROW-4036] - see code comment

@arw2019 arw2019 marked this pull request as draft August 25, 2020 04:31
@github-actions
Copy link

Copy link
Contributor Author

@arw2019 arw2019 Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lose the the more specific traceback and ZeroDivisionError message, in favor of

In [11]: class MyBrokenInt: 
    ...:     def __init__(self): 
    ...:         1/0
In [12]: pa.array([MyBrokenInt()], type=pa.int64())        
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-12-1cf156b165b3> in <module>
----> 1 pa.array([MyBrokenInt()], type=pa.int64())

~/git_repo/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
    269     else:
    270         # ConvertPySequence does strict conversion if type is explicitly passed
--> 271         return _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
    272 
    273 

~/git_repo/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
     38 
     39     with nogil:
---> 40         check_status(ConvertPySequence(sequence, mask, options, &out))
     41 
     42     if out.get().num_chunks() == 1:

~/git_repo/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     82 
     83         if status.IsInvalid():
---> 84             raise ArrowInvalid(message)
     85         elif status.IsIOError():
     86             # Note: OSError constructor is
ArrowInvalid: Could not convert <__main__.MyBrokenInt object at 0x7fc331394290> with type MyBrokenInt: tried to convert to int

but this is the same message as what we get on master for

In [11]: class MyBrokenInt: 
    ...:     def __init__(self): 
    ...:         1/1 

so maybe it's ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is fine, personally

@arw2019 arw2019 marked this pull request as ready for review August 25, 2020 18:22
@arw2019 arw2019 force-pushed the ARROW-7663 branch 2 times, most recently from 9a73767 to 21166d3 Compare August 26, 2020 05:06
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did the error message say before, and what does it show now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master it's

TypeError: an integer is required (got type pyarrow.lib.Int8Array)

verus on this branch

ArrowInvalid: Could not convert [
  5
] with type pyarrow.lib.Int8Array: tried to convert to int

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for this case I find the original error message clearer ..

That's the consequence of the scalar(..) conversion using the array conversion under the hood, I suppose?
But OK, I suppose this is fine (it's maybe mainly the multiline repr of the array in the middle of the sentence that makes it more confusing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is fine, personally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases that this couldn't be converted, but that obj is an integer? When the integer is too big to fit in a C int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and also when converting a negative integer to a uint:

pa.scalar(-1, type='uint8')

No other tests are touched if I recompile without this check

@arw2019 arw2019 force-pushed the ARROW-7663 branch 3 times, most recently from 716bd51 to 0735885 Compare August 31, 2020 18:53
@emkornfield
Copy link
Contributor

@jorisvandenbossche was there more to be done here?

@jorisvandenbossche
Copy link
Member

Thanks for the ping. I think all good. @arw2019 can you just rebase to ensure it's still all passing with latest master?

@arw2019
Copy link
Contributor Author

arw2019 commented Sep 14, 2020

@jorisvandenbossche Rebased and seeing some failures. They're ones also popping up in other, unrelated, PRs, so not sure they're to do with this patch? I'm happy to investigate, though

@jorisvandenbossche
Copy link
Member

There are some known failures on Mac and Appveyor at the moment, so nothing to worry about for this PR.

@jorisvandenbossche
Copy link
Member

Thanks @arw2019 !

@arw2019
Copy link
Contributor Author

arw2019 commented Sep 15, 2020

Thanks @jorisvandenbossche for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants