-
Notifications
You must be signed in to change notification settings - Fork 17
DataFrame sample #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
DataFrame sample #48
Conversation
Codecov Report
@@ Coverage Diff @@
## master #48 +/- ##
==========================================
- Coverage 96.22% 96.21% -0.02%
==========================================
Files 33 33
Lines 2145 2165 +20
==========================================
+ Hits 2064 2083 +19
- Misses 81 82 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work; this is great!
I think it's awesome that you're able to add support for a new API function. Of course, there's still a lot of room for improvement to make this process even easier in the future, but I'm really happy that this part of the codebase is now understable enough for others to make changes. :-)
| `('pandas.core.frame', '__getitem__')`, arg type: strings | Projection| | ||
| `('pandas.core.frame', '__getitem__')`, arg type: series | Selection | | ||
| `('pandas.core.frame', 'dropna')` | Selection | | ||
| `('pandas.core.frame', 'sample')` | Selection | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't use the selection operator for this but introduce a new one that captures its semantics better. Something along the lines of OperatorTypes.RESAMPLE
, what do you think?
The existing inspections and checks need a minor update then also to handle this new operator type. And then we should add new tests for the inspections and checks where it makes sense to check that the new behavior works. That's mainly NoBiasIntroducedFor
and HistogramForColumns
. The tests for that are in test/inspections/test_histogram_for_columns.py
and test/check/test_no_bias_introduced_for.py
.
Would you be willing to add that also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the selection operator is fitting the sampling operation. After all, it (randomly) selects a subset of rows from the DataFrame and the iterator creation ensures that inspections do not have to deal with the row order.
However, I know that you implemented some inspections that assume constant row order and I can therefore understand the idea of creating a new operator type for selections that do not preserve order.
I would also suggest thinking about methods (properties) like loc
, iloc
, and sort_values that can change the row order without selecting. I don't know if resample is the best word to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that the frac
option is allowed to be bigger than 1. Than it obviously isn't a selection anymore. So yes, adding a new OperatorType seems to be a good idea. But not for this evening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment! Yes, upsampling with frac > 1
was the main thing I was concerned about. I guess if loc
and iloc
are used for selecting rows then OperatorType.SELECTION
would be appropriate. Do you think OperatorType.RESAMPLE
would be alright then here? Do you have another naming suggestion? In this context, the dataframe algera presented in this paper is also very interesting if that kind of stuff is interesting for you.
Very good decision to stop working at this time of the day :-) If you decide adding a new OperatorType
is too much work or if you have any questions, just let me know. Thanks for addressing all the other code review comments!
DagNodeDetails(None, ['A']), | ||
OptionalCodeInfo(CodeReference(3, 5, 3, 54), | ||
"pd.DataFrame([0, 2, 4, 5, 10, 15], columns=['A'])")) | ||
expected_select = DagNode(1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name should be updated
Issue #, if available: No issue number available
Description of changes:
Added DataFrame.sample() to the list of supported pandas functions.
Furthermore added it ot the README and added a test. This PR requires #38. The test implemented only works correctly with the fix from #38. It also verifies that the fix from #38 works correctly for unary selections.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.