Remove species_ prefix dependency from image path #383

jdcc · 2025-05-05T20:34:51Z

Related to https://github.com/drivendataorg/zamba-web/issues/512

github-actions · 2025-05-05T20:35:21Z

🚀 Deployed on https://deploy-preview-383--silly-keller-664934.netlify.app

codecov · 2025-05-05T20:46:08Z

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.9%. Comparing base (ce5d28e) to head (37acdd8).

Files with missing lines	Patch %	Lines
zamba/images/manager.py	50.0%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #383     +/-   ##
========================================
+ Coverage    82.8%   82.9%   +0.1%     
========================================
  Files          37      37             
  Lines        3191    3194      +3     
========================================
+ Hits         2645    2651      +6     
+ Misses        546     543      -3

Files with missing lines	Coverage Δ
zamba/images/config.py	`92.4% <100.0%> (+<0.1%)`	⬆️
zamba/models/config.py	`96.2% <100.0%> (+<0.1%)`	⬆️
zamba/images/manager.py	`83.4% <50.0%> (+1.6%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pjbull · 2025-05-07T05:50:43Z

What are the implications of just doing this on save versus ahead of when training starts? Does this mean you couldn't resume training a model? Do we need to the species_ addition at all in the image path if we are already tracking the species columns?

jdcc · 2025-05-07T14:29:27Z

Good questions.

Re implications, just reasoning through: Label inputs and outputs are the important bits. On first train, input labels have no prefixes. There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case. Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.

Re resuming training, my current understanding: We take the checkpoint and the new labels from the user. We add the the species prefix then remove (the first instance of) it to check for label matches between the passed labels and the checkpoint hparam species. If we had stripped the prefix on checkpoint save, the user could pass labels without the prefix and they'd match. If we don't strip the prefix on save (which we're currently aren't), the user labels would only match if they included the prefix.

Re removing the prefix dependency entirely: I've found three places where this dependency still exists on the image path:

Calculating the class weights. Easily changed.
Model instantiation, called here and here, also easily changed.
Dataset splitting
We would also change the training config that introduces the prefix.

@pjbull I think this PR means an improvement over status quo, but it's kinda complicated, so I'm not totally confident on that. Removing this complexity looks straightforward, but potentially unnecessary. Thoughts?

pjbull · 2025-05-07T22:57:04Z

Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.

Yep, this is the crux of the issue we want fixed.

There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case.

They could be displayed in the metrics/logs, so we want those to match the final state.

We would also change the training config that introduces the prefix.

I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.

Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.

jdcc · 2025-05-08T14:01:02Z

I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.

That's why it took me a while to track down. It's here in the pandas code. Here's a little test:

Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.

👍

pjbull · 2025-05-08T15:40:14Z

Ah, this is a series vs. DataFrame gotcha:

In [4]: pd.get_dummies(pd.Series(list("aabbabab")))
Out[4]: 
       a      b
0   True  False
1   True  False
2  False   True
3  False   True
4   True  False
5  False   True
6   True  False
7  False   True

In [5]: pd.get_dummies(pd.DataFrame(list("aabbabab")))
Out[5]: 
     0_a    0_b
0   True  False
1   True  False
2  False   True
3  False   True
4   True  False
5  False   True
6   True  False
7  False   True

jdcc · 2025-05-12T17:06:54Z

I don't think these failures have anything to do with this PR. Might be version updates or something?

Copilot

Pull Request Overview

This PR aims to remove the hard dependency on the "species_" prefix from the image path and related label processing. Key changes include:

Updating the make_split function in zamba/models/config.py to support an optional "species_in_label_order" flag and using removeprefix.
Revising how class labels and one-hot encodings are processed in zamba/images/config.py, now employing a LabelEncoder.
Modifying the weight calculation and test indexing in zamba/images/manager.py and tests/test_image_file_handling.py respectively.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
zamba/models/config.py	Updated species handling logic using removeprefix and loop changes.
zamba/images/manager.py	Changed weight computation to rely on the label column directly.
zamba/images/config.py	Revised label preprocessing via LabelEncoder and one-hot encoding.
tests/test_image_file_handling.py	Updated indexing for clearer and more robust row access.

Comments suppressed due to low confidence (2)

zamba/models/config.py:682

Ensure that the project's minimum Python version supports removeprefix (introduced in Python 3.9) to avoid compatibility issues.

k.removeprefix("species_"): v

zamba/images/config.py:425

Consider whether the removal of reset_index (previously used before assignment to values["labels"]) is intentional, as downstream code may expect a continuous integer index.

values["labels"] = labels

pjbull

A couple little tweaks.

zamba/models/config.py

pjbull · 2025-05-13T00:51:50Z

zamba/images/manager.py

-    classes = labels_df.columns.values
-    class_weights = compute_class_weight("balanced", classes=classes, y=y_array)
+    classes = split.label.unique()
+    classes.sort()


Is this function is called at a place where we can pass the ordered labels as an additional param rather than assuming them here with the sort order?

Changed. The concern is that the split won't have all the labels?

zamba/images/config.py

jdcc marked this pull request as ready for review May 7, 2025 14:29

jdcc changed the title ~~remove species_ prefix on model save~~ Remove species_ prefix dependency from image path May 9, 2025

jdcc added 7 commits May 12, 2025 14:08

remove species_ prefix on model save

2889b19

split without species prefix

c6177ce

handle class weights without prefix

3ab47b1

fix linting and some tests

94b1cd0

still remove prefix when on video path

e5ce072

don't rename in save

e640344

don't remove prefix on save

ad25244

jdcc force-pushed the 512-jdcc-species-names branch from e898025 to ad25244 Compare May 12, 2025 18:08

jdcc requested a review from Copilot May 12, 2025 18:25

Copilot AI reviewed May 12, 2025

View reviewed changes

jdcc requested a review from pjbull May 12, 2025 18:36

pjbull requested changes May 13, 2025

View reviewed changes

small fixes

37acdd8

jdcc requested a review from pjbull May 14, 2025 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove species_ prefix dependency from image path #383

Remove species_ prefix dependency from image path #383

Uh oh!

jdcc commented May 5, 2025

Uh oh!

github-actions bot commented May 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 5, 2025 •

edited

Loading

Uh oh!

pjbull commented May 7, 2025

Uh oh!

jdcc commented May 7, 2025 •

edited

Loading

Uh oh!

pjbull commented May 7, 2025

Uh oh!

jdcc commented May 8, 2025

Uh oh!

pjbull commented May 8, 2025

Uh oh!

jdcc commented May 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

pjbull left a comment

Uh oh!

Uh oh!

pjbull May 13, 2025

Uh oh!

jdcc May 14, 2025

Uh oh!

Uh oh!

Uh oh!

Remove species_ prefix dependency from image path #383

Are you sure you want to change the base?

Remove species_ prefix dependency from image path #383

Uh oh!

Conversation

jdcc commented May 5, 2025

Uh oh!

github-actions bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pjbull commented May 7, 2025

Uh oh!

jdcc commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjbull commented May 7, 2025

Uh oh!

jdcc commented May 8, 2025

Uh oh!

pjbull commented May 8, 2025

Uh oh!

jdcc commented May 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

pjbull left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pjbull May 13, 2025

Choose a reason for hiding this comment

Uh oh!

jdcc May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 5, 2025 •

edited

Loading

codecov bot commented May 5, 2025 •

edited

Loading

jdcc commented May 7, 2025 •

edited

Loading