Skip to content

Conversation

@philastrophist
Copy link

@philastrophist philastrophist commented Jun 20, 2025

This change would add support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate np.array([MyObject()], dtype=object).
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support PythonDict, PythonTypedDict, PythonNamedTuple etc.

  • Accept dtype.kind = 'O' in from_dtype
  • Add the base case of any type
  • Use .iat instead of .iloc to set values in pandas strategies (this allows setting of dictionaries as elements etc)
  • Construct Series rather than setting elements in pandas strategies (this allows dictionaries as elements etc)

@Zac-HD Zac-HD requested a review from Liam-DeVoe June 26, 2025 02:08
Shaun Read added 3 commits July 2, 2025 14:46
@philastrophist
Copy link
Author

Some form of timeout error in CI

@Zac-HD
Copy link
Member

Zac-HD commented Jul 3, 2025

@tybug FAILED hypothesis-python/tests/watchdog/test_database.py::test_database_listener_directory_move - Exception: timing out after waiting 1s for condition lambda: set(events) on Windows CI

(I've hit retry, should be OK soon 🤞)

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your PR, Shaun!

This is looking good, and I'm excited to ship it soon! Small comments below about testing and code-comments; and I can always push something to the changelog when I work out what I wanted for that.

@philastrophist
Copy link
Author

Some interesting error is occurring outside of the changes in this PR...

@philastrophist philastrophist requested a review from Zac-HD July 3, 2025 09:16
@Liam-DeVoe
Copy link
Member

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

That failure might be a real crosshair failure, but I'm not sure it's worth pursuing with such a non-reproducer.

@philastrophist
Copy link
Author

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

As far as I understand at and iat are more basic indexers than loc and iloc in that they can only access a single entry rather than possibly an subset of entries.
But ignoring vector access here, loc will transform dicts into a series and then set them. There's an interesting note in their source here:

# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.

Seems to be vaguely related.

But the important points are:

  1. loc does transformations to the given values stopping us from inserting dicts into series using iloc/loc. This may or may not be a bug. Either way, editing this logic within pandas is likely to be fraught and it's difficult to tell what other transforms might be applied.
  2. at is the intended way to set single values within a dataframe/series according to the docs. It's technically faster but more importantly it doesn't perform any checks or transformations on the value. The logic is a lot simpler. The reason ruff warns against it is that "iloc is more idiomatic and versatile". We know, that in our use-case, we will only ever be setting a series element by integer index, which is what iat is for.

From the docstrings:

DataFrame.iat : Access a single value for a row/column label pair by integer position(s).
DataFrame.iloc : Access a group of rows and columns by integer position(s).
Similar to ``iloc``, in that both provide integer-based lookups. Use
    ``iat`` if you only need to get or set a single value in a DataFrame
    or Series.

Demonstration:

import pandas as pd

s = pd.Series([1, 2, 3], dtype=object)  # object dtype so we dont get mismatch warnings

s.iloc[0] = {'a': 1}
print('series with iloc:\n', s)
print('entry type with iloc:', type(s.iloc[0]))

s.iat[0] = {'a': 1}
print('with iat:\n', s)
print('entry type with iat:', type(s.iat[0]))

prints out:

series with iloc:
 0    a    1
dtype: int64
1                      2
2                      3
dtype: object
entry type with iloc: <class 'pandas.core.series.Series'>
with iat:
 0    {'a': 1}
1           2
2           3
dtype: object
entry type with iat: <class 'dict'>

@philastrophist
Copy link
Author

When do you think we could merge this?

@Liam-DeVoe
Copy link
Member

I'll take a look today, thanks for your patience (and contribution!)

Copy link
Member

@Liam-DeVoe Liam-DeVoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I updated the changelog to be a bit more concise, and would like to improve our testing:

  • I'd like to see a test combining dtype="O" with a strategy that generates a custom (data)class, for both numpy and pandas
    • A test for combining custom objects and normal types in the same dtype="O" array/series would be nice as well

@philastrophist
Copy link
Author

I'm back again!
Could you clarify "custom (data)class, for both numpy and pandas"?

@Liam-DeVoe
Copy link
Member

Could you clarify "custom (data)class, for both numpy and pandas"?

As in: I'd like to see a test which defines a class or dataclass A with a bunch of fields of different types, and passes elements=st.builds(A) to the pandas and numpy strategies which have newly-added support for dtype="O". Then check that you can pull out elements of type A from the pandas series or numpy array. I want to make sure that supplying complicated classes to dtype="O" is well supported!

@philastrophist
Copy link
Author

Changes:

  • Sped up the hot path in numpy set_element by skipping for non-object cases
  • dropped using iat with pandas and instead construct the series using lists (type errors are still raised by pandas) to avoid pandas coercing values we don't want it to coerce (much cleaner anyway)
  • Made the tests a bit more sophisticated (checking exact parity of elements that go into the pandas/numpy strategy and their values when accessed in those numpy/pandas containers)
  • Remove overflow check pre-filter since overflow only happens when pandas errors and tries to display the erroring row using string.ljust
  • Removed assert_safe_equals since we can just assert list equality

@philastrophist philastrophist requested a review from Zac-HD August 27, 2025 18:04
@Zac-HD
Copy link
Member

Zac-HD commented Aug 27, 2025

(looks like you merged master mid-release-process, and that's where the conflicts are coming from)

@philastrophist
Copy link
Author

(looks like you merged master mid-release-process, and that's where the conflicts are coming from)

Ok finally figured that out

@philastrophist
Copy link
Author

Can I get a review?

@philastrophist
Copy link
Author

Can I get a review?

Bump

Copy link
Member

@Liam-DeVoe Liam-DeVoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made several direct changes here, since by the time I was deep enough in the review to give actionable feedback it was less effort to o so myself. I have one comment about the pandas changes, and then I think this is close to being ready.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Liam-DeVoe if you've got some time coming up, I think getting this in should be higher priority than the dropping-py39 cleanups - it's been slow because it's a big complicated subtle change but it'd be great to ship it!

@Liam-DeVoe
Copy link
Member

Liam-DeVoe commented Oct 17, 2025

OK, I've spent a bit understanding the context around this pull.

  • we actually already support dtype="object". What we don't support is automatic inference of a strategy for dtype="object". That's the core of what this PR adds. nps.arrays("O", shape=(1,), elements=st.just(object())) works today
  • I believe the equality code in set_element is just for nice error messages, when numpy unexpectedly converts an element in an array due to dtypes. This can't happen in dtype="object", so we don't need to filter out un-equatable objects.
  • I believe the pandas changes to avoid object conversion is a bug, which I've filed here: BUG: assignment to Series.iloc with dtype="object" converts dictionary to Series pandas-dev/pandas#62723. I'm not a regular pandas user, so I may be mistaken. But not being able to store a dict in an object-type pandas series was pretty surprising to me.

(this means the pandas-coercion behavior is possibly an unrelated latent bug uncovered by the tests in this pull? unsure yet.)

@Liam-DeVoe Liam-DeVoe changed the title Support for numpy.ndarray and pandas.Series with any python object as entry Automatically infer a strategy for dtype="object" Oct 17, 2025
Comment on lines +216 to 218
elif dtype.kind == "O":
return st.from_type(object)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not actually clear to me whether we want st.from_type(object) or from_type(type).flatmap(st.from_type) here. Should we make the former simply register to the latter?

@Liam-DeVoe Liam-DeVoe force-pushed the allow_objects_in_numpy_arrays_and_pandas_series branch from 566aa1f to 1ec77f5 Compare October 18, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants