Skip to content

Conversation

@neilSchroeder
Copy link

@neilSchroeder neilSchroeder commented Oct 31, 2025

Checklist

How zarr.py Handles Zarr V2 Stores

ZarrParser should now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications to zarr.py.

V2 → V3 Normalization Strategy

The parser performs a two-part normalization:

1. Chunk Key Mapping (get_chunk_mapping_prefix)

For V2 arrays:

  • Chunk files are stored directly under the array path: array_name/0, array_name/0.1.2
  • Metadata files (.zarray, .zattrs, etc.) are filtered out
  • Chunk coordinates are normalized to dot-separated format: "0.1.2"
  • File paths in the manifest point to the actual V2 chunk locations
  • Manifest keys contain only chunk coordinates (no path structure)

2. Metadata Conversion (get_metadata())

After converting V2 metadata to V3 using _convert_array_metadata, we have to replace the chunk_key_encoding.

  • The automatic converter preserves V2ChunkKeyEncoding in the V3 metadata
  • When zarr/xarray sees V2ChunkKeyEncoding, it requests chunks using V2-style paths: array/0
  • With DefaultChunkKeyEncoding, zarr requests chunks using V3-style paths: array/c/0
  • ManifestStore.get() expects V3-style paths and uses parse_manifest_index() to extract chunk coordinates
  • parse_manifest_index() requires the /c/ component to correctly parse the path

Additional metadata handling

  • None fill values: Converted to appropriate dtype defaults
  • Dimension names: Extracted from _ARRAY_DIMENSIONS attribute or generated as {array_name}_dim_{i}
  • All other metadata: Converted using zarr's standard V2→V3 migration utilities

Implementation Notes

I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.

@TomNicholas thank you very much for your feedback, it definitely helped me wrap my head around the right approach to take here.

Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and documentation are also up to date.

Copy link
Member

@TomNicholas TomNicholas Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @neilSchroeder ! This is awesome that you're working on this!

However, I don't think you should need to make any changes to this file in order to add this feature.

IIUC you're altering the key-parsing logic inside ManifestStore to accommodate V2-like keys. But that's not the right place for that logic. The ManifestStore is explicitly a V3 store - it's key logic should not need to be changed. Instead, you need to alter the ZarrParser (or add a separate ZarrV2Parser) to do that mapping. The output of the Parser should not have any trace of what file format was parsed, zarr or otherwise. So you might need some v2-key-handling logic like this inside your parser, but not in manifests/store.py. Does that make sense? Am I understanding properly?

Changing this logic in the ManifestStore is also what's causing other, unrelated, tests to fail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this totally makes sense. Thanks for the feedback! I wasn't exactly sure where some of these fixes needed to go so I took a pretty naive test-based-development approach.

I'll think about this a little more carefully and see if I can't come up with a more elegant solution that avoids manipulating the stores.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - shout if you have any questions / aren't sure about anything!

@neilSchroeder
Copy link
Author

neilSchroeder commented Oct 31, 2025

How zarr.py Handles Zarr V2 Stores

ZarrParser should now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications to zarr.py.

V2 → V3 Normalization Strategy

The parser performs a two-part normalization:

1. Chunk Key Mapping (get_chunk_mapping_prefix)

For V2 arrays:

  • Chunk files are stored directly under the array path: array_name/0, array_name/0.1.2
  • Metadata files (.zarray, .zattrs, etc.) are filtered out
  • Chunk coordinates are normalized to dot-separated format: "0.1.2"
  • File paths in the manifest point to the actual V2 chunk locations
  • Manifest keys contain only chunk coordinates (no path structure)

2. Metadata Conversion (get_metadata())

After converting V2 metadata to V3 using _convert_array_metadata, we have to replace the chunk_key_encoding.

  • The automatic converter preserves V2ChunkKeyEncoding in the V3 metadata
  • When zarr/xarray sees V2ChunkKeyEncoding, it requests chunks using V2-style paths: array/0
  • With DefaultChunkKeyEncoding, zarr requests chunks using V3-style paths: array/c/0
  • ManifestStore.get() expects V3-style paths and uses parse_manifest_index() to extract chunk coordinates
  • parse_manifest_index() requires the /c/ component to correctly parse the path

Additional metadata handling

  • None fill values: Converted to appropriate dtype defaults
  • Dimension names: Extracted from _ARRAY_DIMENSIONS attribute or generated as {array_name}_dim_{i}
  • All other metadata: Converted using zarr's standard V2→V3 migration utilities

Implementation Notes

I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.

@TomNicholas thank you very much for your feedback up there, definitely helped me wrap my head around the right approach to take here.

Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and such are also up to date. Also going to move this into the PR notes instead of huge comment here.

@TomNicholas
Copy link
Member

Let me know when you would like a review of this @neilSchroeder !

@neilSchroeder neilSchroeder marked this pull request as ready for review November 3, 2025 22:22
@neilSchroeder
Copy link
Author

@TomNicholas I think it's ready for a review.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @neilSchroeder ! I mostly have a bunch of small gripes 😁

Comment on lines +79 to +82
# List all keys under the array prefix, filtering out metadata files
prefix_keys = [(x,) async for x in zarr_array.store.list_prefix(prefix)]
if not prefix_keys:
return {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an important subtlety here that's worth leaving a comment about: The reason we don't just generate the names of the keys from the chunk grid is because zarr chunks are allowed to be missing, and they are allowed to be missing in our VZ manifests too. It might actually be worth adding a test for this case - the case that there are some chunks but some are missing - VZ should return the fill_value for any missing chunks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've added a test that covers what you're getting at, but it will need some review to make sure I've understood correctly. As for "worth leaving a comment": did you mean in your review here, or do you think it's important to actually add a comment in the file to raise awareness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've added a test that covers what you're getting at,

I see there are tests for the case that there are no chunks at all in the array, but I don't see a test for the case that there are some chunks but some are missing. Maybe that distinction is not important though.

As for "worth leaving a comment": did you mean in your review here, or do you think it's important to actually add a comment in the file to raise awareness?

I meant add a comment to the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added the test and haven't pushed it yet. Sorry.

And I'll add a comment to the file.

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 99.13793% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.31%. Comparing base (cb2912e) to head (a4a271f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/parsers/zarr.py 99.13% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #822      +/-   ##
==========================================
+ Coverage   87.71%   88.31%   +0.60%     
==========================================
  Files          35       35              
  Lines        1880     1968      +88     
==========================================
+ Hits         1649     1738      +89     
+ Misses        231      230       -1     
Files with missing lines Coverage Δ
virtualizarr/parsers/zarr.py 99.33% <99.13%> (+2.55%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Virtualize Native Zarr V2 format

2 participants