-
Notifications
You must be signed in to change notification settings - Fork 51
Parse zarr v2 #822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Parse zarr v2 #822
Conversation
… V2 and V3 formats
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @neilSchroeder ! This is awesome that you're working on this!
However, I don't think you should need to make any changes to this file in order to add this feature.
IIUC you're altering the key-parsing logic inside ManifestStore to accommodate V2-like keys. But that's not the right place for that logic. The ManifestStore is explicitly a V3 store - it's key logic should not need to be changed. Instead, you need to alter the ZarrParser (or add a separate ZarrV2Parser) to do that mapping. The output of the Parser should not have any trace of what file format was parsed, zarr or otherwise. So you might need some v2-key-handling logic like this inside your parser, but not in manifests/store.py. Does that make sense? Am I understanding properly?
Changing this logic in the ManifestStore is also what's causing other, unrelated, tests to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this totally makes sense. Thanks for the feedback! I wasn't exactly sure where some of these fixes needed to go so I took a pretty naive test-based-development approach.
I'll think about this a little more carefully and see if I can't come up with a more elegant solution that avoids manipulating the stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - shout if you have any questions / aren't sure about anything!
How
|
|
Let me know when you would like a review of this @neilSchroeder ! |
|
@TomNicholas I think it's ready for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @neilSchroeder ! I mostly have a bunch of small gripes 😁
| # List all keys under the array prefix, filtering out metadata files | ||
| prefix_keys = [(x,) async for x in zarr_array.store.list_prefix(prefix)] | ||
| if not prefix_keys: | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an important subtlety here that's worth leaving a comment about: The reason we don't just generate the names of the keys from the chunk grid is because zarr chunks are allowed to be missing, and they are allowed to be missing in our VZ manifests too. It might actually be worth adding a test for this case - the case that there are some chunks but some are missing - VZ should return the fill_value for any missing chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've added a test that covers what you're getting at, but it will need some review to make sure I've understood correctly. As for "worth leaving a comment": did you mean in your review here, or do you think it's important to actually add a comment in the file to raise awareness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've added a test that covers what you're getting at,
I see there are tests for the case that there are no chunks at all in the array, but I don't see a test for the case that there are some chunks but some are missing. Maybe that distinction is not important though.
As for "worth leaving a comment": did you mean in your review here, or do you think it's important to actually add a comment in the file to raise awareness?
I meant add a comment to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added the test and haven't pushed it yet. Sorry.
And I'll add a comment to the file.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #822 +/- ##
==========================================
+ Coverage 87.71% 88.31% +0.60%
==========================================
Files 35 35
Lines 1880 1968 +88
==========================================
+ Hits 1649 1738 +89
+ Misses 231 230 -1
🚀 New features to boost your workflow:
|
Checklist
docs/releases.mdHow
zarr.pyHandles Zarr V2 StoresZarrParsershould now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications tozarr.py.V2 → V3 Normalization Strategy
The parser performs a two-part normalization:
1. Chunk Key Mapping (
get_chunk_mapping_prefix)For V2 arrays:
array_name/0,array_name/0.1.22. Metadata Conversion (
get_metadata())After converting V2 metadata to V3 using
_convert_array_metadata, we have to replace thechunk_key_encoding.V2ChunkKeyEncodingin the V3 metadatazarr/xarrayseesV2ChunkKeyEncoding, it requests chunks using V2-style paths:array/0DefaultChunkKeyEncoding,zarrrequests chunks using V3-style paths:array/c/0ManifestStore.get()expects V3-style paths and usesparse_manifest_index()to extract chunk coordinatesparse_manifest_index()requires the/c/component to correctly parse the pathAdditional metadata handling
_ARRAY_DIMENSIONSattribute or generated as{array_name}_dim_{i}zarr's standard V2→V3 migration utilitiesImplementation Notes
I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.
@TomNicholas thank you very much for your feedback, it definitely helped me wrap my head around the right approach to take here.
Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and documentation are also up to date.