Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

zburning
Copy link
Contributor

@zburning zburning commented Feb 1, 2020

Description

Add round_to feature, so that the padded dimension will be rounded up to multiple of this argument.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@zburning zburning requested a review from a team as a code owner February 1, 2020 04:42
@codecov
Copy link

codecov bot commented Feb 1, 2020

Codecov Report

Merging #1133 into master will decrease coverage by 0.99%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1133   +/-   ##
=======================================
- Coverage   88.39%   87.39%   -1%     
=======================================
  Files          71       71           
  Lines        6703     6703           
=======================================
- Hits         5925     5858   -67     
- Misses        778      845   +67
Impacted Files Coverage Δ
src/gluonnlp/data/batchify/embedding.py 45.16% <0%> (-52.42%) ⬇️
src/gluonnlp/vocab/subwords.py 85.1% <0%> (-2.13%) ⬇️

@zburning
Copy link
Contributor Author

zburning commented Feb 1, 2020

Reference
5fe8d7b#r36969715
#1132

@zburning
Copy link
Contributor Author

zburning commented Feb 1, 2020

@TaoLv Sorry for the late updating. Does this round_to feature meet your needs?

@mli
Copy link
Member

mli commented Feb 1, 2020

Job PR-1133/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/1/index.html

@sxjscience
Copy link
Member

LGTM from my side. I think padding it to 8 makes it easier for vectorization and reduces the number of specific kernels.

@TaoLv
Copy link
Member

TaoLv commented Feb 1, 2020

So if we want each batch to have the same sentence length of 128, we need set round_to to 128, right? Also please be more specific it's rounding up or rounding down.

@zburning
Copy link
Contributor Author

zburning commented Feb 1, 2020

@TaoLv Yes you can round_to 128, if you set max_len <= 128. And it's rounding up, thank you for pointing out, I will make the description clearer.

@TaoLv
Copy link
Member

TaoLv commented Feb 1, 2020

Thank you @zburning . Could you please also clarify what will happen if the value of round_to or the sentence length after rounding is larger than max_len?

@zburning
Copy link
Contributor Author

zburning commented Feb 1, 2020

@TaoLv For example if you have a sequence of length 130 and you set round_to = 128, then it will be padded to 128*2 = 256. By setting max_len <= 128, you won't have a sequence larger than 128 so that it won't happen. The code is here.

@TaoLv
Copy link
Member

TaoLv commented Feb 1, 2020

So the final length is possible to be larger than the max_len in the command line? Do I make any mistake in the below table?

real max size round to max len final len
100 8 128 104
100 80 128 160
100 128 128 128
100 200 128 200

@mli
Copy link
Member

mli commented Feb 1, 2020

Job PR-1133/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/2/index.html

@zburning
Copy link
Contributor Author

zburning commented Feb 1, 2020

@TaoLv , Yes, you are right.
Currently, max_len is only used during data preprocessing so that the sequences longer than 128 would be cut to 128. Then padding is done by gluonnlp main API nlp.data.batchify.Pad(round_to=args.round_to,...).

@TaoLv
Copy link
Member

TaoLv commented Feb 1, 2020

Got it. Thank you for the explanation. So even max_len is set in the command line, the final length in the computation may be larger than the max_len if round_to is also set.

@zburning
Copy link
Contributor Author

zburning commented Feb 1, 2020

Yes, it seems to be a bit confusing... A clearer way is always setting round_to=max_len. But by introducing round_to, it can be more flexible for other requirements.

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case; then LGTM

@zburning
Copy link
Contributor Author

zburning commented Feb 2, 2020

@leezu For batchify.Pad(), there are already test cases. Do you mean checking the model outputs with round_to VS outputs without round_to?

@mli
Copy link
Member

mli commented Feb 2, 2020

Job PR-1133/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/3/index.html

@mli
Copy link
Member

mli commented Feb 2, 2020

Job PR-1133/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/4/index.html

@mli
Copy link
Member

mli commented Feb 2, 2020

Job PR-1133/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/5/index.html

@mli
Copy link
Member

mli commented Feb 2, 2020

Job PR-1133/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/6/index.html

@leezu leezu added the release focus Progress focus for release label Feb 3, 2020
@mli
Copy link
Member

mli commented Feb 3, 2020

Job PR-1133/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/7/index.html

@mli
Copy link
Member

mli commented Feb 3, 2020

Job PR-1133/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/8/index.html

@mli
Copy link
Member

mli commented Feb 3, 2020

Job PR-1133/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/9/index.html

@mli
Copy link
Member

mli commented Feb 3, 2020

Job PR-1133/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/10/index.html

@leezu leezu merged commit 3a6a8f6 into dmlc:master Feb 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants