Skip to content

[BLOG]: Supporting dask arrays in scipy via the Python Array API standard #904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented Feb 4, 2025

Text styling

  • The blog is written with plain language (where relevant).
  • If there are headers, they use the proper header tags in order to do so (with only one level-one header).
  • All links describe where they link to (for example, check the Quansight labs website).
  • Any kind of styling that the author uses (for example, bold for emphasis) is consistent throughout the blog.

Non-text contents

  • Blog post featured image is in PNG or JPEG format, not SVG.
  • All content is represented as text (for example, images need alt text and videos need captions or descriptive transcripts).
  • If there are emojis, there are not more than three in a row.
  • Don't use flashing gifs or videos.
  • If it were to be read as plain text, the blog still makes sense and no information is missing.

Copy link

vercel bot commented Feb 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
labs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 6, 2025 0:31am

@lithomas1 lithomas1 changed the title [BLOG]: Supporting dask arrays in scipy via the Python Array API stan… [BLOG]: Supporting dask arrays in scipy via the Python Array API standard Feb 4, 2025
@lithomas1
Copy link
Contributor Author

Not done yet, but just wanted to put up a quick draft of my blog post.

The main omission is probably the case study section where I port a scipy.stats workflow to dask arrays and look at performance.

All other sections are basically complete.

@rgommers
Copy link
Member

This looks like a good start @lithomas1, thanks. The flow of the story at a high level looks good to me. The case study section is important indeed; the post is still pretty draft now so I didn't review in detail.

@rgommers rgommers added labs 🔭 Items related to the Labs website type: content 📝 labels May 4, 2025
@lithomas1 lithomas1 marked this pull request as ready for review June 6, 2025 00:18
@lithomas1
Copy link
Contributor Author

@rgommers This should be ready for a look now.

Copy link
Member

@pavithraes pavithraes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lithomas1 Thank you! This is a good read!

I've shared some suggestions, mainly around phrasing :)

---
title: 'Supporting dask arrays in scipy via the Python Array API standard'
authors: [thomas-li]
published: May 26, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that we'll need to update this before merging. :)

Comment on lines +15 to +16
In this post, I describe my journey getting SciPy to work with Dask arrays natively via the array API and the current
limitations and future outlook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this post, I describe my journey getting SciPy to work with Dask arrays natively via the array API and the current
limitations and future outlook.
In this post, I describe my journey getting SciPy to work with Dask Arrays natively via the Array API standard,
and discuss the current limitations and future outlook of this work.


## Introduction: A quick refresher of the Python Array API standard

For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/),
The [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/),

I think it'll be nicer to start with a simpler statement

## Introduction: A quick refresher of the Python Array API standard

For those unfamiliar, the [Python Array API standard](https://data-apis.org/array-api/latest/API_specification/),
is a specification aimed at unifying the various APIs of different array libraries (e.g. Numpy, PyTorch, JAX, Dask, etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is a specification aimed at unifying the various APIs of different array libraries (e.g. Numpy, PyTorch, JAX, Dask, etc.).
is a specification aimed at unifying the various APIs of different array libraries (e.g. NumPy, PyTorch, JAX, Dask, etc.).

I see there are several different capitalizations for various libraries through the blog. Could you please do a find and replace to have one style for all?

I think these are the capitalizations: NumPy, SciPy, Dask, PyTorch, pandas, JAX, and CuPy, unless you're referring to the API, in which case it's all lowercase and presented as inline code like dask.array. I think this is already the case for the most part, but I noticed a few deviations here and there, hence the explicit comment. :)

users to treat arbritrary array objects as numpy arrays via duck typing.

Today, [array api support](https://scipy.github.io/devdocs/dev/api-dev/array_api.html) in scipy has progressed a long
way since mid 2023 when array API support was first experimentally introduced within the libary. While the array API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
way since mid 2023 when array API support was first experimentally introduced within the libary. While the array API
way since mid-2023 when array API support was first experimentally introduced within the library. While the array API


`*` - Some public API functions/methods in this module have not yet been ported to the Array API standard.
(Status refers to the status of dask.array with )
See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality)
See the [SciPy developer docs](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality)

See [here](https://scipy.github.io/devdocs/dev/api-dev/array_api.html#currently-supported-functionality)
for a list of supported functions/methods.

As of today, the `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As of today, the `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to
The `scipy.fft/special/stats` modules have the best support for dask arrays today, and are able to

In the next section, we will take a look more closely at how array API compatibility enables better performance with
dask arrays within the `scipy.stats` module.

## Example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this!


From this p-value, we can reject our null hypothesis that the average fare for trips with one passenger is the same as the average fare for trips with multiple passengers.

While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors)
While we weren't entirely able to avoid computation in the middle (dask still struggles with unknown shapes which we get through our boolean masking on the dataframe), we were able to entirely keep the computation in dask. This is a big improvement over the pre-Array API behavior where the input dask arrays would be cast to numpy arrays (forcing computation and storage of intermediate results in one worker which can lead to performance degredation and out-of-memory errors).


Looking forward, we'd also like to enable `dask.array` support via the Array API in other Array API
compatible libraries, most notably scikit-learn. A previous
[attempt](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[attempt](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled
[attempt (scikit-learn PR#28588)](https://github.com/scikit-learn/scikit-learn/pull/28588) to add array API support within scikit-learn stalled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
labs 🔭 Items related to the Labs website type: content 📝
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants