-
Notifications
You must be signed in to change notification settings - Fork 31
Remove species_ prefix dependency from image path #383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #383 +/- ##
========================================
+ Coverage 82.8% 82.9% +0.1%
========================================
Files 37 37
Lines 3191 3194 +3
========================================
+ Hits 2645 2651 +6
+ Misses 546 543 -3
🚀 New features to boost your workflow:
|
What are the implications of just doing this on save versus ahead of when training starts? Does this mean you couldn't resume training a model? Do we need to the |
Good questions. Re implications, just reasoning through: Label inputs and outputs are the important bits. On first train, input labels have no prefixes. There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case. Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great. Re resuming training, my current understanding: We take the checkpoint and the new labels from the user. We add the the species prefix then remove (the first instance of) it to check for label matches between the passed labels and the checkpoint hparam species. If we had stripped the prefix on checkpoint save, the user could pass labels without the prefix and they'd match. If we don't strip the prefix on save (which we're currently aren't), the user labels would only match if they included the prefix. Re removing the prefix dependency entirely: I've found three places where this dependency still exists on the image path:
@pjbull I think this PR means an improvement over status quo, but it's kinda complicated, so I'm not totally confident on that. Removing this complexity looks straightforward, but potentially unnecessary. Thoughts? |
Yep, this is the crux of the issue we want fixed.
They could be displayed in the metrics/logs, so we want those to match the final state.
I think I'm missing a step here. Why does Overall, I'd take a more complex change to not add the prefix |
That's why it took me a while to track down. It's here in the pandas code. Here's a little test:
👍 |
Ah, this is a series vs. DataFrame gotcha: In [4]: pd.get_dummies(pd.Series(list("aabbabab")))
Out[4]:
a b
0 True False
1 True False
2 False True
3 False True
4 True False
5 False True
6 True False
7 False True
In [5]: pd.get_dummies(pd.DataFrame(list("aabbabab")))
Out[5]:
0_a 0_b
0 True False
1 True False
2 False True
3 False True
4 True False
5 False True
6 True False
7 False True |
I don't think these failures have anything to do with this PR. Might be version updates or something? |
e898025
to
ad25244
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR aims to remove the hard dependency on the "species_" prefix from the image path and related label processing. Key changes include:
- Updating the make_split function in zamba/models/config.py to support an optional "species_in_label_order" flag and using removeprefix.
- Revising how class labels and one-hot encodings are processed in zamba/images/config.py, now employing a LabelEncoder.
- Modifying the weight calculation and test indexing in zamba/images/manager.py and tests/test_image_file_handling.py respectively.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
zamba/models/config.py | Updated species handling logic using removeprefix and loop changes. |
zamba/images/manager.py | Changed weight computation to rely on the label column directly. |
zamba/images/config.py | Revised label preprocessing via LabelEncoder and one-hot encoding. |
tests/test_image_file_handling.py | Updated indexing for clearer and more robust row access. |
Comments suppressed due to low confidence (2)
zamba/models/config.py:682
- Ensure that the project's minimum Python version supports removeprefix (introduced in Python 3.9) to avoid compatibility issues.
k.removeprefix("species_"): v
zamba/images/config.py:425
- Consider whether the removal of reset_index (previously used before assignment to values["labels"]) is intentional, as downstream code may expect a continuous integer index.
values["labels"] = labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple little tweaks.
zamba/images/manager.py
Outdated
classes = labels_df.columns.values | ||
class_weights = compute_class_weight("balanced", classes=classes, y=y_array) | ||
classes = split.label.unique() | ||
classes.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function is called at a place where we can pass the ordered labels as an additional param rather than assuming them here with the sort order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. The concern is that the split won't have all the labels?
Related to https://github.com/drivendataorg/zamba-web/issues/512