Skip to content

Conversation

@psobolewskiPhD
Copy link
Member

@psobolewskiPhD psobolewskiPhD commented May 16, 2025

Currently the plugin uses the zarr tifffile backend to read multiscale tiff-variants, which includes ndpi, svs, and other whole-slide image formats. Note: this is only supported with zarr<3)

In this PR, I wrap the arrays with dask, which has two primary advantages:

  1. it allows for re-chunking to happen (chunks='auto') chunks='1 MiB' with a 1 MiB chunk mass, which is a major performance boost for ndpi images from Hamamatsu slide scanners which have native chunks of (8, 3840, 3). You can download some samples at:
    https://openslide.cs.cmu.edu/download/openslide-testdata/Hamamatsu/
    (I've also verified this with scans produced by JAX microscopy core.)
    For a svs (native 256, 256, 3) chunks='1 MiB' yields up as 512, 512, 3 and is nice and performant.
    The ndpis end up as either chunksize=(32, 8192, 3) or chunksize=(24, 11520, 3) and are also much improved.
    Nota bene: dask by default aims for 128 MiB chunks -- we may consider making this configurable in napari: Preferences: consider adding dask config array.chunk-size along side the cache size napari#7924 -- which is not great for visualization, hence specifying 1 MiB
  2. by using dask, we can take advantage of caching, which increases memory usage, but also contributes to improved performance.

Note: With this PR for the big ndpi i get more similar, though still slightly worse performance than with QuPath. There is a stutter at the resolution levels (without ASYNC) and with ASYNC it takes a split second to load the 0 level.
If I manually set the chunks, e.g. chunks=(2048, 2048, 3)) or chunks=(1024, 1024, 3)) then performance is bit better for the npdi and with ASYNC pretty much identical to QuPath.
But I think the auto-chunks are fine for the general use case.

@psobolewskiPhD
Copy link
Member Author

I do find it interesting that chunks (1024, 1024) is so good when the native chunks are (8, 3840).
Could use group[0].chunks to get the native chunk size and then use that to set the dask chunk size.

cc: @jni

Also not sure we want to release the important fix, which was #43 before or after this PR?

@psobolewskiPhD
Copy link
Member Author

Should the plugin just reduce the dask setting? and let it auto choose then? This still isn't great, because it keeps superwide shape, much wider than any display e.g.

dask.config.get('array.chunk-size')
Out[4]: '6 MiB'

viewer.layers[0].data[0]
Out[5]: dask.array<from-zarr, shape=(101376, 188160, 3), dtype=uint8, chunksize=(64, 30720, 3), chunktype=numpy.ndarray>

This is better than using native chunks, but still not great.
It feels like using multiples of the native chunk size does make sense, but I'm not sure how to make it universal if someone has a 4D multiscale image or something.

@psobolewskiPhD
Copy link
Member Author

After some discussion on zulip: https://napari.zulipchat.com/#narrow/channel/212875-general/topic/lazy.20reading.20tiff-based.20WSI/near/518694288
I am leaning towards using:
data = [da.from_zarr(arr, chunks='1 MiB') for _, arr in group.arrays()]
which seems ok. The svs (native 256, 256, 3) ends up as 512, 512, 3 and is nice and performant. the ndpis end up as either chunksize=(32, 8192, 3) or chunksize=(24, 11520, 3), which are both fine and much better than the 128 MiB chunks from auto. I'm not firm on the 1 MiB vs say 2 MiB or even 3 MiB

@psobolewskiPhD
Copy link
Member Author

For CMU-1
ndpi from : https://openslide.cs.cmu.edu/download/openslide-testdata/Hamamatsu/
svs from: https://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/

With data = [da.from_zarr(arr, chunks='1 MiB') for _, arr in group.arrays()] and ASYNC on:
performance:
SVS QuPath ~ SVS napari > NDPI QuPath > NDPI napari
but the difference for the ndpi isn't huge. It's fine honestly, it takes s split second to snap in the 0 resolution when you zoom in.

@jni
Copy link
Member

jni commented May 17, 2025

One issue is that dask array performance to get a single chunk out is quite bad when the overall task graph has many (say millions) of chunks. afaik that issue remains unfixed. So I would probably err on the side of slightly bigger chunks, say 4MiB. How does that work for you performance-wise @psobolewskiPhD? Otherwise I think it's a good idea to merge and release this quickly.

@psobolewskiPhD
Copy link
Member Author

I'll try to test ASAP, but I don't have a good idea for a real test other than eyeballing.
For sure 2 MiB is fine. With 4 MiB I suspect the ndpi will start to suffer because the shape will be more and more pathological--unless you want to only pan side to side 🤣

@jni
Copy link
Member

jni commented May 17, 2025

I like your eyeball tests! You have a good eye! 😊

@jni
Copy link
Member

jni commented May 17, 2025

(at any rate, I'm happy for you to self-merge as-is to get you unblocked. We can always iterate!)

@jni
Copy link
Member

jni commented May 17, 2025

(and you could turn my comment into an issue. 😊)

@psobolewskiPhD
Copy link
Member Author

I only have the smaller sample svs and ndpi.
with 3 MiB the SVS is (1024, 1024, 3) which is fine obviously. But the ndpi is (64, 16384, 3). It's still OK, but you start to notice it a bit more in terms of the time to snap in the level 0 when zooming with ASYNC.

4 MiB
SVS: (1170, 1024, 3) perfectly performant.
NDPI: (72, 18432, 3) similar to the 3 MiB? Not sure we can really make this one much better without hard-coding something.

The question is how likely is a 3D multiscale? because my understanding is for 3D you want smaller chunks.

The full size ones are ~60 GiB uncompressed? so then at 1 MiB it's 60K chunks, at 3 MiB it would be 20K chunks...

@TimMonko
Copy link
Contributor

One issue is that dask array performance to get a single chunk out is quite bad when the overall task graph has many (say millions) of chunks.

Anything above 200ish GB I start to consider 'bigger than we can expect to work smoothly out of the box', so even with any of the chunk sizes mentioned, it would still be at most 200k chunks (with 1MiB), so I think whatever y'all deem reasonable would find me content.

The question is how likely is a 3D multiscale? because my understanding is for 3D you want smaller chunks.

I think think this is the direction light sheet is going -- but I have very little experience.

@psobolewskiPhD
Copy link
Member Author

I think think this is the direction light sheet is going -- but I have very little experience.

I would assume light sheets are not using TIFF variants, but something like zarr or n5?

I'm tempted to leave this at 1 MiB for now, which was recommended by multiple people much smarter than me. And we can always adjust it upwards if the need arises. This might all become moot if we want to move to a zarr v3 world.

@jni
Copy link
Member

jni commented May 17, 2025

Agreed! Thanks for all your work on this @psobolewskiPhD ! I'm on phone but I'll try to remember to merge and release in the morning.

@psobolewskiPhD psobolewskiPhD marked this pull request as draft May 24, 2025 14:28
@psobolewskiPhD
Copy link
Member Author

Making draft until after #48 is merged.
Then we reconsider this PR in light of zarr3/dask issues.

@psobolewskiPhD
Copy link
Member Author

psobolewskiPhD commented May 29, 2025

Ok, we're going to need this.
I finally had a chance to test with some real ndpi, that are 7Gb on disk and ~70Gb uncompressed.
Like Hamamatsu-1.ndpi (from https://openslide.cs.cmu.edu/download/openslide-testdata/Hamamatsu/).
I can replicate the performance regression reported in cgohlke/tifffile#297 (comment) above using updated napari-tiff.
And with zarr3 visualizing in napari is brutally bad compared to zarr2. At least 5x worse (4-6-second delays vs. choppiness).

  • py313, zarr3, no async. brutal, even pan level 5 (lowest) or minor zoom blocks multi seconds, double-click zoom blocks few seconds
    • async: very slow to update after pan on level 5, 5+ seconds
      • zoom to level 1 from 5: ~4 s
      • home from a crop of level 5, multiple seconds
  • py313, zarr2, no async, very slow but not blocking on level 5. home is fast. double-click to zoom is OK, scroll is choppy, but not horrendous. home from level 0 is fine.
    • async, pan level 5 maybe 1 s refresh? zoom to level 1 fine, small delay

It's probably at least partially related to:

viewer.layers[0].data[5].shape
Out[1]: (3168, 5880, 3)

viewer.layers[0].data[5].chunks
Out[3]: (8, 120, 3)

:freeze:

With wrapping the arrays in a dask.array.from_zarr() with chunks='1 MiB' as in this PR gives:

viewer.layers[0].data[5]
Out[1]: dask.array<from-zarr, shape=(3168, 5880, 3), dtype=uint8, chunksize=(152, 2280, 3), chunktype=numpy.ndarray>

And now it's pretty alright with zarr3. Still worse than the equivalent with zarr2, but the user experience is passable for sure.
I will update the PR, as I don't see another avenue right now.

@ianhi
Copy link

ianhi commented Jun 6, 2025

@psobolewskiPhD I'm trying to wrok on these zarr 3 performance issues. Just so i understand the simplest reproducer here is to download a large tiff from openslide and open it up with napari-tiff main branch prior to this PR?

@psobolewskiPhD
Copy link
Member Author

psobolewskiPhD commented Jun 6, 2025

Hi @ianhi thanks for reaching out!
Yes, with the ndpi from https://openslide.cs.cmu.edu/download/openslide-testdata/Hamamatsu/ being particularly problematic as noted in my testing above -- all with release 0.6.1 napari and the latest release of this plugin. It would seem to be related to the number of small chunks in the ndpi (their shape isn't ideal in general) and the tifffile zarr store.
Other WSI like SVS seem better behaved with typical chunks either 256x256 or 512x512 for all data levels.

You may also want to look at the reproducers from cgohlke here:
cgohlke/tifffile#297 (comment)

@psobolewskiPhD
Copy link
Member Author

psobolewskiPhD commented Jun 14, 2025

Implementing this is a workaround for:

Of course, merging back doesn't work:

@codecov
Copy link

codecov bot commented Jun 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.00%. Comparing base (ce361e7) to head (cd17ef1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #45      +/-   ##
==========================================
+ Coverage   79.95%   80.00%   +0.04%     
==========================================
  Files           8        8              
  Lines         459      460       +1     
==========================================
+ Hits          367      368       +1     
  Misses         92       92              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@psobolewskiPhD psobolewskiPhD marked this pull request as ready for review June 14, 2025 23:15
@psobolewskiPhD
Copy link
Member Author

Bumping this. I did some tests with the latest napari, tifffile, zarr, dask and still get the same behaviors, where using dask with 1MiB chunks is massively more performant for the very large NDPI i have access to with the "pathological" native chunks. Oddly, just doing some np.asarray(data[5]) or whatever naive tests zarr was faster than dask, but in napari the difference is starkly the other way around. I assume it has to do with the way slicing is implemented and leverages dask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants