Skip to content

Conversation

jdcc
Copy link
Contributor

@jdcc jdcc commented May 5, 2025

Copy link
Contributor

github-actions bot commented May 5, 2025

Copy link

codecov bot commented May 5, 2025

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.9%. Comparing base (ce5d28e) to head (37acdd8).

Files with missing lines Patch % Lines
zamba/images/manager.py 50.0% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #383     +/-   ##
========================================
+ Coverage    82.8%   82.9%   +0.1%     
========================================
  Files          37      37             
  Lines        3191    3194      +3     
========================================
+ Hits         2645    2651      +6     
+ Misses        546     543      -3     
Files with missing lines Coverage Δ
zamba/images/config.py 92.4% <100.0%> (+<0.1%) ⬆️
zamba/models/config.py 96.2% <100.0%> (+<0.1%) ⬆️
zamba/images/manager.py 83.4% <50.0%> (+1.6%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pjbull
Copy link
Member

pjbull commented May 7, 2025

What are the implications of just doing this on save versus ahead of when training starts? Does this mean you couldn't resume training a model? Do we need to the species_ addition at all in the image path if we are already tracking the species columns?

@jdcc
Copy link
Contributor Author

jdcc commented May 7, 2025

Good questions.

Re implications, just reasoning through: Label inputs and outputs are the important bits. On first train, input labels have no prefixes. There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case. Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.

Re resuming training, my current understanding: We take the checkpoint and the new labels from the user. We add the the species prefix then remove (the first instance of) it to check for label matches between the passed labels and the checkpoint hparam species. If we had stripped the prefix on checkpoint save, the user could pass labels without the prefix and they'd match. If we don't strip the prefix on save (which we're currently aren't), the user labels would only match if they included the prefix.

Re removing the prefix dependency entirely: I've found three places where this dependency still exists on the image path:

@pjbull I think this PR means an improvement over status quo, but it's kinda complicated, so I'm not totally confident on that. Removing this complexity looks straightforward, but potentially unnecessary. Thoughts?

@jdcc jdcc marked this pull request as ready for review May 7, 2025 14:29
@pjbull
Copy link
Member

pjbull commented May 7, 2025

Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.

Yep, this is the crux of the issue we want fixed.

There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case.

They could be displayed in the metrics/logs, so we want those to match the final state.

We would also change the training config that introduces the prefix.

I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.

Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.

@jdcc
Copy link
Contributor Author

jdcc commented May 8, 2025

I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.

That's why it took me a while to track down. It's here in the pandas code. Here's a little test:

image

Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.

👍

@pjbull
Copy link
Member

pjbull commented May 8, 2025

Ah, this is a series vs. DataFrame gotcha:

In [4]: pd.get_dummies(pd.Series(list("aabbabab")))
Out[4]: 
       a      b
0   True  False
1   True  False
2  False   True
3  False   True
4   True  False
5  False   True
6   True  False
7  False   True

In [5]: pd.get_dummies(pd.DataFrame(list("aabbabab")))
Out[5]: 
     0_a    0_b
0   True  False
1   True  False
2  False   True
3  False   True
4   True  False
5  False   True
6   True  False
7  False   True

@jdcc jdcc changed the title remove species_ prefix on model save Remove species_ prefix dependency from image path May 9, 2025
@jdcc
Copy link
Contributor Author

jdcc commented May 12, 2025

I don't think these failures have anything to do with this PR. Might be version updates or something?

@jdcc jdcc force-pushed the 512-jdcc-species-names branch from e898025 to ad25244 Compare May 12, 2025 18:08
@jdcc jdcc requested a review from Copilot May 12, 2025 18:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to remove the hard dependency on the "species_" prefix from the image path and related label processing. Key changes include:

  • Updating the make_split function in zamba/models/config.py to support an optional "species_in_label_order" flag and using removeprefix.
  • Revising how class labels and one-hot encodings are processed in zamba/images/config.py, now employing a LabelEncoder.
  • Modifying the weight calculation and test indexing in zamba/images/manager.py and tests/test_image_file_handling.py respectively.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
zamba/models/config.py Updated species handling logic using removeprefix and loop changes.
zamba/images/manager.py Changed weight computation to rely on the label column directly.
zamba/images/config.py Revised label preprocessing via LabelEncoder and one-hot encoding.
tests/test_image_file_handling.py Updated indexing for clearer and more robust row access.
Comments suppressed due to low confidence (2)

zamba/models/config.py:682

  • Ensure that the project's minimum Python version supports removeprefix (introduced in Python 3.9) to avoid compatibility issues.
k.removeprefix("species_"): v

zamba/images/config.py:425

  • Consider whether the removal of reset_index (previously used before assignment to values["labels"]) is intentional, as downstream code may expect a continuous integer index.
values["labels"] = labels

@jdcc jdcc requested a review from pjbull May 12, 2025 18:36
Copy link
Member

@pjbull pjbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple little tweaks.

classes = labels_df.columns.values
class_weights = compute_class_weight("balanced", classes=classes, y=y_array)
classes = split.label.unique()
classes.sort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function is called at a place where we can pass the ordered labels as an additional param rather than assuming them here with the sort order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. The concern is that the split won't have all the labels?

@jdcc jdcc requested a review from pjbull May 14, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants