Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

colinkyle
Copy link
Contributor

@colinkyle colinkyle commented Sep 18, 2019

Description

I added the ability to perform a stratified split in train_valid_split

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage - I'm not too familiar with test writing, but if you can direct me to a tutorial I'd be happy to get it done.
  • Code is well-documented

Changes

  • [X ] stratify option for test_valid_split, test_utils.test_train_valid_split
  • Feature2, tests, (and when applicable, API doc)

Comments

Backwards compatible, the only edge case I can think of is if someone tries to use a float to stratify their data and end up getting non-sense results.

cc @dmlc/gluon-nlp-team

@colinkyle colinkyle requested a review from a team as a code owner September 18, 2019 21:51
@codecov
Copy link

codecov bot commented Sep 18, 2019

Codecov Report

Merging #933 into master will decrease coverage by 1.59%.
The diff coverage is 88.54%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #933     +/-   ##
=========================================
- Coverage   89.98%   88.38%   -1.6%     
=========================================
  Files          67       67             
  Lines        6372     6296     -76     
=========================================
- Hits         5734     5565    -169     
- Misses        638      731     +93
Impacted Files Coverage Δ
src/gluonnlp/utils/parameter.py 83.92% <ø> (+2.57%) ⬆️
src/gluonnlp/data/utils.py 81.57% <ø> (ø) ⬆️
src/gluonnlp/utils/version.py 100% <ø> (ø) ⬆️
src/gluonnlp/model/seq2seq_encoder_decoder.py 80% <ø> (+4.61%) ⬆️
src/gluonnlp/utils/files.py 49.01% <0%> (ø) ⬆️
src/gluonnlp/model/highway.py 100% <100%> (ø) ⬆️
src/gluonnlp/model/language_model.py 98.49% <100%> (ø) ⬆️
src/gluonnlp/data/word_embedding_evaluation.py 96.94% <100%> (+4.55%) ⬆️
src/gluonnlp/data/batchify/__init__.py 100% <100%> (ø) ⬆️
src/gluonnlp/base.py 86.2% <100%> (ø) ⬆️
... and 32 more

@mli
Copy link
Member

mli commented Sep 18, 2019

Job PR-933/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/1/index.html

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, @colinkyle. To move forward, let's:

  • fix lint errors
  • add a unittest for the new functionality

@colinkyle
Copy link
Contributor Author

Thanks for your patience, I don't see any tests for the functions within gluonnlp.data.utils should I create a new file under the unittest folder? or is there already a file I should modify with a test for train_valid_split?
Thanks!

@mli
Copy link
Member

mli commented Sep 19, 2019

Job PR-933/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/2/index.html

@szha
Copy link
Member

szha commented Sep 19, 2019

@colinkyle let's put the test in tests/test_utils.py for now. Or if you prefer to create a new file for it we can do that too.

@mli
Copy link
Member

mli commented Sep 19, 2019

Job PR-933/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/3/index.html

@mli
Copy link
Member

mli commented Sep 19, 2019

Job PR-933/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/4/index.html

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The PR will be merged once the mxnet website intersphinx is fixed.


classes, digitized = np.unique(stratify, return_inverse=True)
n_classes = len(classes)
num_class = np.bincount(digitized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem of using bincount is that len(num_class) != n_classes in some cases, e.g., labels = [0,1,2,4] in which 3 is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lifted that logic directly from sklearn's implementation and I believe it corrects for that problem. "digitized" is numbered labels starting at zero (i.e., labels = [1, 2, 4, 4, 2, 1, 1, 1, 2, 2, 0, 0], digitized = [1, 2, 3, 3, 2, 1, 1, 1, 2, 2, 0, 0]), and len(num_class) does equal n_classes.

@colinkyle
Copy link
Contributor Author

Is this going to move forward? I'm not sure where we are with review.

@leezu
Copy link
Contributor

leezu commented Oct 15, 2019

@szhengac

@szhengac
Copy link
Member

@colinkyle. It will be merged once it passed the CI.

@colinkyle
Copy link
Contributor Author

@szhengac thanks for the update.

@szhengac
Copy link
Member

@leezu the CI error is related to downloading a dataset. It seems that we have seen it in other pr before?

@sxjscience
Copy link
Member

I think we should just rebase and merge.

@mli
Copy link
Member

mli commented Nov 6, 2019

Job PR-933/16 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/16/index.html

@szha szha added the release focus Progress focus for release label Nov 7, 2019
@mli
Copy link
Member

mli commented Jan 15, 2020

Job PR-933/17 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-933/17/index.html

@leezu leezu merged commit 4d0ec7c into dmlc:master Jan 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants