Skip to content

Conversation

@shaycrk
Copy link
Contributor

@shaycrk shaycrk commented Nov 13, 2021

Building on #870, this PR adds BaselineRankMultiFeature to allow ranking by more than one feature as a baseline model (if we decide on a different name for the sort order parameter there, we'll want to be sure to rename it here as well for consistency)

if not isinstance(rule, dict):
raise ValueError('Rules for BaselineRankMultiFeature must be of type dict')
if not rule.keys() >= REQUIRED_KEYS:
raise ValueError(f'BaselineRankMultiFeature rule "{rule}" missing one or more reuired keys ({REQUIRED_KEYS})')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed -- thanks!

ranker.fit(x=self.data["X_train"], y=self.data["y_train"])
results = ranker.predict_proba(self.data["X_test"])
if direction_value:
expected_ranks = [6, 1, 3, 0, 5, 2, 4, 5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a clearer way to express this test (and one less prone to rewriting if the test fixture has to be modified for any reason) would be to order by the rank and assert that each one has a lower or equal/higher or equal value than the one prior.

I have mixed thoughts on this approach of duplicating logic in the test from the code under test; after all, if it's been implemented incorrectly in the code under test, who's to say the test doesn't have the same bad logic and thus erroneously pass? But I'm guessing in this case you wouldn't be able to copy and paste code from the code under test, and it probably wouldn't have that problem.

I'm not sure this is a big issue here. In general it's good to go for low maintenance tests that aren't beholden to what the test fixtures specifically look like, especially when the test fixtures are shared between different tests. You could in the future end up adding a new baseline class that needs more entropy than this fixture provides, and the data changes you make could require all these tests to be updated. What is the probability of needing to make such a change in the future? I don't know. But in general this type of thing happens a lot, and it makes the tests seem like a burden so it's good to minimize when we can.

As far as the data goes, I might be misunderstanding the logic used; shouldn't the calls with inverse direction values but the same data have inverse expected ranks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @thcrock! That's a good point -- I just added a helper function to check that the returned scores align to the expected ordering without having to reproduce the logic from the underlying function itself. Let me know if that looks better to you or if you think it could be cleaned up a bit more?

I think the calls with the different direction values do get the inverse expected ranks, but let me know if I'm missing something. For instance:

  if direction_value:
      expected_ranks = [6, 1, 3, 0, 5, 2, 4, 5]
  else:
     expected_ranks = [0, 5, 3, 6, 1, 4, 2, 1]

@not-the-fish
Copy link
Contributor

No further thoughts beyond what @thcrock has said but wondering if we should just deprecate PercentileRankOneFeature with this since it obviates the need for that ranker.

@shaycrk
Copy link
Contributor Author

shaycrk commented Nov 15, 2021

@ecsalomon -- I'm fine either keeping or deprecating PercentileRankOneFeature. I agree this should fit most of the use cases for that, but unlike PercentileRankOneFeature, it doesn't attempt to ensure that the scores are interpretable as percentiles, just correctly ordered, so we would be loosing some functionality. I'm not sure how often that feature gets used, though.

@not-the-fish
Copy link
Contributor

Yeah, I mean, there was no real motivation for percentile being the method there. A deprecation warning might surface anyone who is actually using them as percentiles downstream somewhere.

Digging in a bit more, I might actually argue for the generic scaler to be the thing that controls how scores from the rankers get scaled (i.e., they return whatever scale is applied to the input features) rather than the rankers applying their own scaling.

@shaycrk
Copy link
Contributor Author

shaycrk commented Nov 15, 2021

That makes sense about the percentiles, I can certainly add a deprecation warning here and direct people to the newer ranking method.

On the scaler, how would the generic scaler work when you're ranking across multiple features? Even with one feature, if you're learning the scaling from the training set, you'll also run into cases where the test values are outside of [0,1] and we might not want to cap them for the purposes of simple ranking -- of course, the 0 to 1 range is a bit arbitrary here, but if we are scaling to something like that it's probably good to keep them within the range.

@not-the-fish
Copy link
Contributor

Yeah, a fair point about not artificially capping test ranks if they fall outside the range. I'm convinced I was wrong. :)

@shaycrk
Copy link
Contributor Author

shaycrk commented Nov 19, 2021

@thcrock -- just pinging if you have other thoughts on the unit tests here?

Also, @rayidghani -- other thoughts on the descend parameter name? I don't think low_value_high_score is great, but also don't have better ideas.

@shaycrk
Copy link
Contributor Author

shaycrk commented Dec 7, 2021

Going ahead and merging this now, but we can revisit the parameter name and unit tests in future pull request(s) if we want.

Closes #869

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants