-
Notifications
You must be signed in to change notification settings - Fork 60
Add BaselineRankMultiFeature #871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if not isinstance(rule, dict): | ||
| raise ValueError('Rules for BaselineRankMultiFeature must be of type dict') | ||
| if not rule.keys() >= REQUIRED_KEYS: | ||
| raise ValueError(f'BaselineRankMultiFeature rule "{rule}" missing one or more reuired keys ({REQUIRED_KEYS})') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed -- thanks!
| ranker.fit(x=self.data["X_train"], y=self.data["y_train"]) | ||
| results = ranker.predict_proba(self.data["X_test"]) | ||
| if direction_value: | ||
| expected_ranks = [6, 1, 3, 0, 5, 2, 4, 5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a clearer way to express this test (and one less prone to rewriting if the test fixture has to be modified for any reason) would be to order by the rank and assert that each one has a lower or equal/higher or equal value than the one prior.
I have mixed thoughts on this approach of duplicating logic in the test from the code under test; after all, if it's been implemented incorrectly in the code under test, who's to say the test doesn't have the same bad logic and thus erroneously pass? But I'm guessing in this case you wouldn't be able to copy and paste code from the code under test, and it probably wouldn't have that problem.
I'm not sure this is a big issue here. In general it's good to go for low maintenance tests that aren't beholden to what the test fixtures specifically look like, especially when the test fixtures are shared between different tests. You could in the future end up adding a new baseline class that needs more entropy than this fixture provides, and the data changes you make could require all these tests to be updated. What is the probability of needing to make such a change in the future? I don't know. But in general this type of thing happens a lot, and it makes the tests seem like a burden so it's good to minimize when we can.
As far as the data goes, I might be misunderstanding the logic used; shouldn't the calls with inverse direction values but the same data have inverse expected ranks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @thcrock! That's a good point -- I just added a helper function to check that the returned scores align to the expected ordering without having to reproduce the logic from the underlying function itself. Let me know if that looks better to you or if you think it could be cleaned up a bit more?
I think the calls with the different direction values do get the inverse expected ranks, but let me know if I'm missing something. For instance:
if direction_value:
expected_ranks = [6, 1, 3, 0, 5, 2, 4, 5]
else:
expected_ranks = [0, 5, 3, 6, 1, 4, 2, 1]
|
No further thoughts beyond what @thcrock has said but wondering if we should just deprecate |
|
@ecsalomon -- I'm fine either keeping or deprecating |
|
Yeah, I mean, there was no real motivation for percentile being the method there. A deprecation warning might surface anyone who is actually using them as percentiles downstream somewhere. Digging in a bit more, I might actually argue for the generic scaler to be the thing that controls how scores from the rankers get scaled (i.e., they return whatever scale is applied to the input features) rather than the rankers applying their own scaling. |
|
That makes sense about the percentiles, I can certainly add a deprecation warning here and direct people to the newer ranking method. On the scaler, how would the generic scaler work when you're ranking across multiple features? Even with one feature, if you're learning the scaling from the training set, you'll also run into cases where the test values are outside of [0,1] and we might not want to cap them for the purposes of simple ranking -- of course, the 0 to 1 range is a bit arbitrary here, but if we are scaling to something like that it's probably good to keep them within the range. |
|
Yeah, a fair point about not artificially capping test ranks if they fall outside the range. I'm convinced I was wrong. :) |
|
@thcrock -- just pinging if you have other thoughts on the unit tests here? Also, @rayidghani -- other thoughts on the descend parameter name? I don't think |
|
Going ahead and merging this now, but we can revisit the parameter name and unit tests in future pull request(s) if we want. Closes #869 |
Building on #870, this PR adds
BaselineRankMultiFeatureto allow ranking by more than one feature as a baseline model (if we decide on a different name for the sort order parameter there, we'll want to be sure to rename it here as well for consistency)