DataFrame sample #48

adrianlut · 2021-06-09T18:14:28Z

Issue #, if available: No issue number available

Description of changes:

Added DataFrame.sample() to the list of supported pandas functions.

Furthermore added it ot the README and added a test. This PR requires #38. The test implemented only works correctly with the fix from #38. It also verifies that the fix from #38 works correctly for unary selections.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-commenter · 2021-06-10T08:08:07Z

Codecov Report

Merging #48 (d9812e6) into master (7d6137d) will decrease coverage by 0.01%.
The diff coverage is 97.29%.

@@            Coverage Diff             @@
##           master      #48      +/-   ##
==========================================
- Coverage   96.22%   96.21%   -0.02%     
==========================================
  Files          33       33              
  Lines        2145     2165      +20     
==========================================
+ Hits         2064     2083      +19     
- Misses         81       82       +1

Impacted Files	Coverage Δ
mlinspect/monkeypatching/_patch_pandas.py	`96.37% <94.44%> (-0.16%)`	⬇️
mlinspect/backends/_iter_creation.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d6137d...d9812e6. Read the comment docs.

stefan-grafberger

Thanks for your work; this is great!

I think it's awesome that you're able to add support for a new API function. Of course, there's still a lot of room for improvement to make this process even easier in the future, but I'm really happy that this part of the codebase is now understable enough for others to make changes. :-)

stefan-grafberger · 2021-06-10T08:06:07Z

mlinspect/monkeypatching/README.md

 | `('pandas.core.frame', '__getitem__')`, arg type: strings | Projection|
 | `('pandas.core.frame', '__getitem__')`, arg type: series | Selection |
 | `('pandas.core.frame', 'dropna')` | Selection      |
+| `('pandas.core.frame', 'sample')` | Selection      |


I think we shouldn't use the selection operator for this but introduce a new one that captures its semantics better. Something along the lines of OperatorTypes.RESAMPLE, what do you think?

The existing inspections and checks need a minor update then also to handle this new operator type. And then we should add new tests for the inspections and checks where it makes sense to check that the new behavior works. That's mainly NoBiasIntroducedFor and HistogramForColumns. The tests for that are in test/inspections/test_histogram_for_columns.py and test/check/test_no_bias_introduced_for.py.

Would you be willing to add that also?

I think that the selection operator is fitting the sampling operation. After all, it (randomly) selects a subset of rows from the DataFrame and the iterator creation ensures that inspections do not have to deal with the row order.

However, I know that you implemented some inspections that assume constant row order and I can therefore understand the idea of creating a new operator type for selections that do not preserve order.

I would also suggest thinking about methods (properties) like loc, iloc, and sort_values that can change the row order without selecting. I don't know if resample is the best word to use.

I just noticed that the frac option is allowed to be bigger than 1. Than it obviously isn't a selection anymore. So yes, adding a new OperatorType seems to be a good idea. But not for this evening.

Thanks for your comment! Yes, upsampling with frac > 1 was the main thing I was concerned about. I guess if loc and iloc are used for selecting rows then OperatorType.SELECTION would be appropriate. Do you think OperatorType.RESAMPLE would be alright then here? Do you have another naming suggestion? In this context, the dataframe algera presented in this paper is also very interesting if that kind of stuff is interesting for you.

Very good decision to stop working at this time of the day :-) If you decide adding a new OperatorType is too much work or if you have any questions, just let me know. Thanks for addressing all the other code review comments!

mlinspect/monkeypatching/_patch_pandas.py

stefan-grafberger · 2021-06-10T08:21:56Z

test/monkeypatching/test_patch_pandas.py

+                                   DagNodeDetails(None, ['A']),
+                                   OptionalCodeInfo(CodeReference(3, 5, 3, 54),
+                                                    "pd.DataFrame([0, 2, 4, 5, 10, 15], columns=['A'])"))
+    expected_select = DagNode(1,


This variable name should be updated

Adrian Lutsch added 4 commits June 9, 2021 16:55

Bug fix annotation order after selections and joins

1497719

Corrected train_test_split test

c2a0035

Added DataFrame.sample as supported method

045e6ce

Added sample to the README

d9812e6

stefan-grafberger requested changes Jun 10, 2021

View reviewed changes

Improved DataFrame.sample DagNode description

f0dc775

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataFrame sample #48

DataFrame sample #48

Uh oh!

adrianlut commented Jun 9, 2021

Uh oh!

codecov-commenter commented Jun 10, 2021

Uh oh!

stefan-grafberger left a comment

Uh oh!

stefan-grafberger Jun 10, 2021

Uh oh!

adrianlut Jun 10, 2021

Uh oh!

adrianlut Jun 10, 2021

Uh oh!

stefan-grafberger Jun 10, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

stefan-grafberger Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DataFrame sample #48

Are you sure you want to change the base?

DataFrame sample #48

Uh oh!

Conversation

adrianlut commented Jun 9, 2021

Uh oh!

codecov-commenter commented Jun 10, 2021

Codecov Report

Uh oh!

stefan-grafberger left a comment

Choose a reason for hiding this comment

Uh oh!

stefan-grafberger Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

adrianlut Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

adrianlut Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

stefan-grafberger Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stefan-grafberger Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefan-grafberger Jun 10, 2021 •

edited

Loading