Parse zarr v2 #822

neilSchroeder · 2025-10-31T19:43:38Z

Checklist

Closes Virtualize Native Zarr V2 format #565
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.md
New functionality has documentation

How `zarr.py` Handles Zarr V2 Stores

ZarrParser should now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications to zarr.py.

V2 → V3 Normalization Strategy

The parser performs a two-part normalization:

1. Chunk Key Mapping (`get_chunk_mapping_prefix`)

For V2 arrays:

Chunk files are stored directly under the array path: array_name/0, array_name/0.1.2
Metadata files (.zarray, .zattrs, etc.) are filtered out
Chunk coordinates are normalized to dot-separated format: "0.1.2"
File paths in the manifest point to the actual V2 chunk locations
Manifest keys contain only chunk coordinates (no path structure)

2. Metadata Conversion (`get_metadata()`)

After converting V2 metadata to V3 using _convert_array_metadata, we have to replace the chunk_key_encoding.

The automatic converter preserves V2ChunkKeyEncoding in the V3 metadata
When zarr/xarray sees V2ChunkKeyEncoding, it requests chunks using V2-style paths: array/0
With DefaultChunkKeyEncoding, zarr requests chunks using V3-style paths: array/c/0
ManifestStore.get() expects V3-style paths and uses parse_manifest_index() to extract chunk coordinates
parse_manifest_index() requires the /c/ component to correctly parse the path

Additional metadata handling

None fill values: Converted to appropriate dtype defaults
Dimension names: Extracted from _ARRAY_DIMENSIONS attribute or generated as {array_name}_dim_{i}
All other metadata: Converted using zarr's standard V2→V3 migration utilities

Implementation Notes

I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.

@TomNicholas thank you very much for your feedback, it definitely helped me wrap my head around the right approach to take here.

Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and documentation are also up to date.

…rror handling

… V2 data

… V2 and V3 formats

for more information, see https://pre-commit.ci

TomNicholas · 2025-10-31T20:18:03Z

virtualizarr/manifests/store.py

Hey @neilSchroeder ! This is awesome that you're working on this!

However, I don't think you should need to make any changes to this file in order to add this feature.

IIUC you're altering the key-parsing logic inside ManifestStore to accommodate V2-like keys. But that's not the right place for that logic. The ManifestStore is explicitly a V3 store - it's key logic should not need to be changed. Instead, you need to alter the ZarrParser (or add a separate ZarrV2Parser) to do that mapping. The output of the Parser should not have any trace of what file format was parsed, zarr or otherwise. So you might need some v2-key-handling logic like this inside your parser, but not in manifests/store.py. Does that make sense? Am I understanding properly?

Changing this logic in the ManifestStore is also what's causing other, unrelated, tests to fail.

Yeah this totally makes sense. Thanks for the feedback! I wasn't exactly sure where some of these fixes needed to go so I took a pretty naive test-based-development approach.

I'll think about this a little more carefully and see if I can't come up with a more elegant solution that avoids manipulating the stores.

Great - shout if you have any questions / aren't sure about anything!

neilSchroeder · 2025-10-31T22:59:52Z

How `zarr.py` Handles Zarr V2 Stores

ZarrParser should now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications to zarr.py.

V2 → V3 Normalization Strategy

The parser performs a two-part normalization:

1. Chunk Key Mapping (`get_chunk_mapping_prefix`)

For V2 arrays:

Chunk files are stored directly under the array path: array_name/0, array_name/0.1.2
Metadata files (.zarray, .zattrs, etc.) are filtered out
Chunk coordinates are normalized to dot-separated format: "0.1.2"
File paths in the manifest point to the actual V2 chunk locations
Manifest keys contain only chunk coordinates (no path structure)

2. Metadata Conversion (`get_metadata()`)

After converting V2 metadata to V3 using _convert_array_metadata, we have to replace the chunk_key_encoding.

The automatic converter preserves V2ChunkKeyEncoding in the V3 metadata
When zarr/xarray sees V2ChunkKeyEncoding, it requests chunks using V2-style paths: array/0
With DefaultChunkKeyEncoding, zarr requests chunks using V3-style paths: array/c/0
ManifestStore.get() expects V3-style paths and uses parse_manifest_index() to extract chunk coordinates
parse_manifest_index() requires the /c/ component to correctly parse the path

Additional metadata handling

None fill values: Converted to appropriate dtype defaults
Dimension names: Extracted from _ARRAY_DIMENSIONS attribute or generated as {array_name}_dim_{i}
All other metadata: Converted using zarr's standard V2→V3 migration utilities

Implementation Notes

I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.

@TomNicholas thank you very much for your feedback up there, definitely helped me wrap my head around the right approach to take here.

Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and such are also up to date. Also going to move this into the PR notes instead of huge comment here.

…aintainability, linted

…inted

TomNicholas · 2025-11-03T22:19:49Z

Let me know when you would like a review of this @neilSchroeder !

neilSchroeder · 2025-11-03T22:23:11Z

@TomNicholas I think it's ready for a review.

TomNicholas

Thanks for working on this @neilSchroeder ! I mostly have a bunch of small gripes 😁

TomNicholas · 2025-11-04T16:32:55Z

virtualizarr/parsers/zarr.py

+        # List all keys under the array prefix, filtering out metadata files
+        prefix_keys = [(x,) async for x in zarr_array.store.list_prefix(prefix)]
+        if not prefix_keys:
+            return {}


There's an important subtlety here that's worth leaving a comment about: The reason we don't just generate the names of the keys from the chunk grid is because zarr chunks are allowed to be missing, and they are allowed to be missing in our VZ manifests too. It might actually be worth adding a test for this case - the case that there are some chunks but some are missing - VZ should return the fill_value for any missing chunks.

I think I've added a test that covers what you're getting at, but it will need some review to make sure I've understood correctly. As for "worth leaving a comment": did you mean in your review here, or do you think it's important to actually add a comment in the file to raise awareness?

I think I've added a test that covers what you're getting at,

I see there are tests for the case that there are no chunks at all in the array, but I don't see a test for the case that there are some chunks but some are missing. Maybe that distinction is not important though.

As for "worth leaving a comment": did you mean in your review here, or do you think it's important to actually add a comment in the file to raise awareness?

I meant add a comment to the file.

I just added the test and haven't pushed it yet. Sorry.

And I'll add a comment to the file.

virtualizarr/parsers/zarr.py

virtualizarr/tests/test_parsers/test_zarr.py

virtualizarr/parsers/zarr.py

…ibutes

codecov · 2025-11-06T22:01:12Z

Codecov Report

❌ Patch coverage is 99.13793% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.31%. Comparing base (cb2912e) to head (a4a271f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
virtualizarr/parsers/zarr.py	99.13%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #822      +/-   ##
==========================================
+ Coverage   87.71%   88.31%   +0.60%     
==========================================
  Files          35       35              
  Lines        1880     1968      +88     
==========================================
+ Hits         1649     1738      +89     
+ Misses        231      230       -1

Files with missing lines	Coverage Δ
virtualizarr/parsers/zarr.py	`99.33% <99.13%> (+2.55%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

neilSchroeder added 8 commits October 31, 2025 13:54

Implement Zarr V2 to V3 metadata conversion with fill value handling

7027e0d

Enhance parse_manifest_index to support V2 and V3 chunk key parsing

0cb9ab3

Refactor parse_manifest_index to improve regex pattern matching and e…

f43a822

…rror handling

Enhance get_chunk_mapping_prefix to support V2 and V3 chunk path parsing

a5dec74

Enhance build_chunk_manifest to calculate chunk grid shape for inline…

5ebb020

… V2 data

Enhance test_virtual_dataset_zarr to handle dimension name checks for…

08f7dd3

… V2 and V3 formats

Remove redundant check for V2 format in get_metadata function

126db99

cleaning up

bbb5980

neilSchroeder temporarily deployed to test-release October 31, 2025 19:44 — with GitHub Actions Inactive

[pre-commit.ci] auto fixes from pre-commit.com hooks

e4f3019

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release October 31, 2025 19:45 Inactive

TomNicholas reviewed Oct 31, 2025

View reviewed changes

neilSchroeder added 3 commits October 31, 2025 15:38

revert store

8d8386b

linting

ce42f47

merge and lint

a7baf03

neilSchroeder temporarily deployed to test-release October 31, 2025 22:21 — with GitHub Actions Inactive

fixing mypy typing

f27c866

neilSchroeder temporarily deployed to test-release October 31, 2025 22:28 — with GitHub Actions Inactive

removing redundant code, linting

f7c8434

neilSchroeder temporarily deployed to test-release October 31, 2025 23:26 — with GitHub Actions Inactive

neilSchroeder added 2 commits November 3, 2025 09:21

refactor zarr parsing to use strategy pattern for extensibility and m…

3d2d705

…aintainability, linted

refactor test, add tests to improve coverage of zarr parsing (97%), l…

aa9bbe0

…inted

neilSchroeder temporarily deployed to test-release November 3, 2025 16:25 — with GitHub Actions Inactive

neilSchroeder added 2 commits November 3, 2025 14:35

adding v2 parsing as new feature

e6cabaf

updating ZarrParser documentation

90c621f

neilSchroeder temporarily deployed to test-release November 3, 2025 22:18 — with GitHub Actions Inactive

neilSchroeder marked this pull request as ready for review November 3, 2025 22:22

TomNicholas requested changes Nov 4, 2025

View reviewed changes

neilSchroeder added 2 commits November 5, 2025 10:39

converting protocol to ABC

96bd1d4

adding tests for sparse files being filled with default fill values

3e39f12

neilSchroeder temporarily deployed to test-release November 6, 2025 19:24 — with GitHub Actions Inactive

fix zeros list

49cb3a4

neilSchroeder temporarily deployed to test-release November 6, 2025 19:24 — with GitHub Actions Inactive

neilSchroeder added 3 commits November 6, 2025 12:29

adding comment about chunk key discovery

a16f595

cleaning up a bit based on comments

8103aab

fixing issue with conflicting test assertions around v2 metadata attr…

a4a271f

…ibutes

neilSchroeder temporarily deployed to test-release November 6, 2025 21:58 — with GitHub Actions Inactive

refactoring common bits of code

3b9ce88

neilSchroeder temporarily deployed to test-release November 6, 2025 22:04 — with GitHub Actions Inactive

neilSchroeder added 2 commits November 6, 2025 15:11

raise error on shard detection for v3

3348604

test that sharded v3 array raises error

8745e9f

neilSchroeder temporarily deployed to test-release November 6, 2025 22:12 — with GitHub Actions Inactive

fixing mypy errors

75f439e

neilSchroeder deployed to test-release November 6, 2025 22:19 — with GitHub Actions View deployment

neilSchroeder requested a review from TomNicholas November 6, 2025 23:27

Parse zarr v2 #822

Are you sure you want to change the base?

Parse zarr v2 #822

Uh oh!

Conversation

neilSchroeder commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

How zarr.py Handles Zarr V2 Stores

V2 → V3 Normalization Strategy

1. Chunk Key Mapping (get_chunk_mapping_prefix)

2. Metadata Conversion (get_metadata())

Additional metadata handling

Implementation Notes

Uh oh!

TomNicholas Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neilSchroeder Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

TomNicholas Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

neilSchroeder commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How zarr.py Handles Zarr V2 Stores

V2 → V3 Normalization Strategy

1. Chunk Key Mapping (get_chunk_mapping_prefix)

2. Metadata Conversion (get_metadata())

Additional metadata handling

Implementation Notes

Uh oh!

TomNicholas commented Nov 3, 2025

Uh oh!

neilSchroeder commented Nov 3, 2025

Uh oh!

TomNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

TomNicholas Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

neilSchroeder Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

TomNicholas Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

neilSchroeder Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neilSchroeder commented Oct 31, 2025 •

edited

Loading

How `zarr.py` Handles Zarr V2 Stores

1. Chunk Key Mapping (`get_chunk_mapping_prefix`)

2. Metadata Conversion (`get_metadata()`)

TomNicholas Oct 31, 2025 •

edited

Loading

neilSchroeder commented Oct 31, 2025 •

edited

Loading

How `zarr.py` Handles Zarr V2 Stores

1. Chunk Key Mapping (`get_chunk_mapping_prefix`)

2. Metadata Conversion (`get_metadata()`)

codecov bot commented Nov 6, 2025 •

edited

Loading