-
Notifications
You must be signed in to change notification settings - Fork 58
[BLOG]: Supporting dask arrays in scipy via the Python Array API standard #904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Not done yet, but just wanted to put up a quick draft of my blog post. The main omission is probably the case study section where I port a scipy.stats workflow to dask arrays and look at performance. All other sections are basically complete. |
This looks like a good start @lithomas1, thanks. The flow of the story at a high level looks good to me. The case study section is important indeed; the post is still pretty draft now so I didn't review in detail. |
@rgommers This should be ready for a look now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lithomas1 Thank you! This is a good read!
I've shared some suggestions, mainly around phrasing :)
--- | ||
title: 'Supporting dask arrays in scipy via the Python Array API standard' | ||
authors: [thomas-li] | ||
published: May 26, 2025 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that we'll need to update this before merging. :)
In this post, I describe my journey getting SciPy to work with Dask arrays natively via the array API and the current | ||
limitations and future outlook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this post, I describe my journey getting SciPy to work with Dask arrays natively via the array API and the current | |
limitations and future outlook. | |
In this post, I describe my journey getting SciPy to work with Dask Arrays natively via the Array API standard, | |
and discuss the current limitations and future outlook of this work. |
|
||
## Introduction: A quick refresher of the Python Array API standard | ||
|
||
For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), | |
The [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), |
I think it'll be nicer to start with a simpler statement
## Introduction: A quick refresher of the Python Array API standard | ||
|
||
For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/), | ||
is a specification aimed at unifying the various APIs of different array libraries (e.g. Numpy, PyTorch, JAX, Dask, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a specification aimed at unifying the various APIs of different array libraries (e.g. Numpy, PyTorch, JAX, Dask, etc.). | |
is a specification aimed at unifying the various APIs of different array libraries (e.g. NumPy, PyTorch, JAX, Dask, etc.). |
I see there are several different capitalizations for various libraries through the blog. Could you please do a find and replace to have one style for all?
I think these are the capitalizations: NumPy, SciPy, Dask, PyTorch, pandas, JAX, and CuPy, unless you're referring to the API, in which case it's all lowercase and presented as inline code like dask.array
. I think this is already the case for the most part, but I noticed a few deviations here and there, hence the explicit comment. :)
users to treat arbritrary array objects as numpy arrays via duck typing. | ||
|
||
Today, [array api support](https://scipy.github.io/devdocs/dev/api-dev/array_api.html) in scipy has progressed a long | ||
way since mid 2023 when array API support was first experimentally introduced within the libary. While the array API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
way since mid 2023 when array API support was first experimentally introduced within the libary. While the array API | |
way since mid-2023 when array API support was first experimentally introduced within the library. While the array API |
|
||
`*` - Some public API functions/methods in this module have not yet been ported to the Array API standard. | ||
(Status refers to the status of dask.array with ) | ||
See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) | |
See the [SciPy developer docs](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) |
See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality) | ||
for a list of supported functions/methods. | ||
|
||
As of today, the `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today, the `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to | |
The `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to |
In the next section, we will take a look more closely at how array API compatibility enables better performance with | ||
dask arrays within the `scipy.stats` module. | ||
|
||
## Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like this!
|
||
From this p-value, we can reject our null hypothesis that the average fare for trips with one passenger is the same as the average fare for trips with multiple passengers. | ||
|
||
While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors) | |
While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors). |
|
||
Looking forward, we'd also like to enable `dask.array` support via the Array API in other Array API | ||
compatible libraries, most notably scikit-learn. A previous | ||
[attempt](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[attempt](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled | |
[attempt (scikit-learn PR#28588)](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled |
Text styling
Non-text contents