[FEATURE] Add stratified train_valid_split similar to sklearn.model_selection.train_test_split #933

colinkyle · 2019-09-18T21:51:59Z

Description

I added the ability to perform a stratified split in train_valid_split

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage - I'm not too familiar with test writing, but if you can direct me to a tutorial I'd be happy to get it done.
Code is well-documented

Changes

[X ] stratify option for test_valid_split, test_utils.test_train_valid_split
Feature2, tests, (and when applicable, API doc)

Comments

Backwards compatible, the only edge case I can think of is if someone tries to use a float to stratify their data and end up getting non-sense results.

cc @dmlc/gluon-nlp-team

codecov · 2019-09-18T21:52:14Z

Codecov Report

Merging #933 into master will decrease coverage by 1.59%.
The diff coverage is 88.54%.

@@            Coverage Diff            @@
##           master     #933     +/-   ##
=========================================
- Coverage   89.98%   88.38%   -1.6%     
=========================================
  Files          67       67             
  Lines        6372     6296     -76     
=========================================
- Hits         5734     5565    -169     
- Misses        638      731     +93

Impacted Files	Coverage Δ
src/gluonnlp/utils/parameter.py	`83.92% <ø> (+2.57%)`	⬆️
src/gluonnlp/data/utils.py	`81.57% <ø> (ø)`	⬆️
src/gluonnlp/utils/version.py	`100% <ø> (ø)`	⬆️
src/gluonnlp/model/seq2seq_encoder_decoder.py	`80% <ø> (+4.61%)`	⬆️
src/gluonnlp/utils/files.py	`49.01% <0%> (ø)`	⬆️
src/gluonnlp/model/highway.py	`100% <100%> (ø)`	⬆️
src/gluonnlp/model/language_model.py	`98.49% <100%> (ø)`	⬆️
src/gluonnlp/data/word_embedding_evaluation.py	`96.94% <100%> (+4.55%)`	⬆️
src/gluonnlp/data/batchify/__init__.py	`100% <100%> (ø)`	⬆️
src/gluonnlp/base.py	`86.2% <100%> (ø)`	⬆️
... and 32 more

mli · 2019-09-18T22:25:01Z

Job PR-933/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/1/index.html

szha

Thank you for the contribution, @colinkyle. To move forward, let's:

fix lint errors
add a unittest for the new functionality

colinkyle · 2019-09-19T17:53:30Z

Thanks for your patience, I don't see any tests for the functions within gluonnlp.data.utils should I create a new file under the unittest folder? or is there already a file I should modify with a test for train_valid_split?
Thanks!

mli · 2019-09-19T19:05:07Z

Job PR-933/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/2/index.html

szha · 2019-09-19T19:33:11Z

@colinkyle let's put the test in tests/test_utils.py for now. Or if you prefer to create a new file for it we can do that too.

mli · 2019-09-19T20:08:19Z

Job PR-933/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/3/index.html

mli · 2019-09-19T20:46:30Z

Job PR-933/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/4/index.html

szha

LGTM. The PR will be merged once the mxnet website intersphinx is fixed.

szhengac · 2019-10-06T08:23:33Z

src/gluonnlp/data/utils.py

+
+        classes, digitized = np.unique(stratify, return_inverse=True)
+        n_classes = len(classes)
+        num_class = np.bincount(digitized)


One problem of using bincount is that len(num_class) != n_classes in some cases, e.g., labels = [0,1,2,4] in which 3 is missing.

I lifted that logic directly from sklearn's implementation and I believe it corrects for that problem. "digitized" is numbered labels starting at zero (i.e., labels = [1, 2, 4, 4, 2, 1, 1, 1, 2, 2, 0, 0], digitized = [1, 2, 3, 3, 2, 1, 1, 1, 2, 2, 0, 0]), and len(num_class) does equal n_classes.

colinkyle · 2019-10-15T16:58:00Z

Is this going to move forward? I'm not sure where we are with review.

leezu · 2019-10-15T17:47:49Z

@szhengac

szhengac · 2019-10-15T18:03:59Z

@colinkyle. It will be merged once it passed the CI.

colinkyle · 2019-10-15T18:11:02Z

@szhengac thanks for the update.

szhengac · 2019-10-18T21:38:14Z

@leezu the CI error is related to downloading a dataset. It seems that we have seen it in other pr before?

sxjscience · 2019-10-18T21:39:07Z

I think we should just rebase and merge.

mli · 2019-11-06T07:16:04Z

Job PR-933/16 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/16/index.html

mli · 2020-01-15T16:40:56Z

Job PR-933/17 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/17/index.html

colinkyle requested a review from a team as a code owner September 18, 2019 21:51

szha reviewed Sep 18, 2019

View reviewed changes

add stratified sampling to train_valid_split and unittest

70a7b27

colinkyle force-pushed the master branch from b323a91 to 70a7b27 Compare September 20, 2019 22:34

szha approved these changes Sep 22, 2019

View reviewed changes

szhengac reviewed Oct 6, 2019

View reviewed changes

szhengac approved these changes Oct 15, 2019

View reviewed changes

szhengac and others added 3 commits October 18, 2019 14:44

Update utils.py

b59b36a

Update utils.py

8d82756

Merge branch 'master' into colinkyle

6a4721c

szha added the release focus Progress focus for release label Nov 7, 2019

Merge branch 'master' into stratified

333e4a1

leezu merged commit 4d0ec7c into dmlc:master Jan 16, 2020

[FEATURE] Add stratified train_valid_split similar to sklearn.model_selection.train_test_split #933

[FEATURE] Add stratified train_valid_split similar to sklearn.model_selection.train_test_split #933

Uh oh!

Conversation

colinkyle commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

codecov bot commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mli commented Sep 18, 2019

Uh oh!

szha left a comment

Choose a reason for hiding this comment

Uh oh!

colinkyle commented Sep 19, 2019

Uh oh!

mli commented Sep 19, 2019

Uh oh!

szha commented Sep 19, 2019

Uh oh!

mli commented Sep 19, 2019

Uh oh!

mli commented Sep 19, 2019

Uh oh!

szha left a comment

Choose a reason for hiding this comment

Uh oh!

szhengac Oct 6, 2019

Choose a reason for hiding this comment

Uh oh!

colinkyle Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

colinkyle commented Oct 15, 2019

Uh oh!

leezu commented Oct 15, 2019

Uh oh!

szhengac commented Oct 15, 2019

Uh oh!

colinkyle commented Oct 15, 2019

Uh oh!

szhengac commented Oct 18, 2019

Uh oh!

sxjscience commented Oct 18, 2019

Uh oh!

mli commented Nov 6, 2019

Uh oh!

mli commented Jan 15, 2020

Uh oh!

Uh oh!

colinkyle commented Sep 18, 2019 •

edited

Loading

codecov bot commented Sep 18, 2019 •

edited

Loading