Add round_to feature to BERT&XLNet finetuning scripts #1133

zburning · 2020-02-01T04:42:27Z

Description

Add round_to feature, so that the padded dimension will be rounded up to multiple of this argument.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

test

deleting trailing white space

merge from master

codecov · 2020-02-01T04:42:30Z

Codecov Report

Merging #1133 into master will decrease coverage by 0.99%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1133   +/-   ##
=======================================
- Coverage   88.39%   87.39%   -1%     
=======================================
  Files          71       71           
  Lines        6703     6703           
=======================================
- Hits         5925     5858   -67     
- Misses        778      845   +67

Impacted Files	Coverage Δ
src/gluonnlp/data/batchify/embedding.py	`45.16% <0%> (-52.42%)`	⬇️
src/gluonnlp/vocab/subwords.py	`85.1% <0%> (-2.13%)`	⬇️

zburning · 2020-02-01T04:43:36Z

Reference
5fe8d7b#r36969715
#1132

zburning · 2020-02-01T04:44:54Z

@TaoLv Sorry for the late updating. Does this round_to feature meet your needs?

mli · 2020-02-01T05:15:15Z

Job PR-1133/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/1/index.html

sxjscience · 2020-02-01T05:46:52Z

LGTM from my side. I think padding it to 8 makes it easier for vectorization and reduces the number of specific kernels.

TaoLv · 2020-02-01T13:30:52Z

So if we want each batch to have the same sentence length of 128, we need set round_to to 128, right? Also please be more specific it's rounding up or rounding down.

zburning · 2020-02-01T13:33:36Z

@TaoLv Yes you can round_to 128, if you set max_len <= 128. And it's rounding up, thank you for pointing out, I will make the description clearer.

TaoLv · 2020-02-01T13:45:12Z

Thank you @zburning . Could you please also clarify what will happen if the value of round_to or the sentence length after rounding is larger than max_len?

zburning · 2020-02-01T13:50:54Z

@TaoLv For example if you have a sequence of length 130 and you set round_to = 128, then it will be padded to 128*2 = 256. By setting max_len <= 128, you won't have a sequence larger than 128 so that it won't happen. The code is here.

TaoLv · 2020-02-01T14:08:28Z

So the final length is possible to be larger than the max_len in the command line? Do I make any mistake in the below table?

real max size	round to	max len	final len
100	8	128	104
100	80	128	160
100	128	128	128
100	200	128	200

mli · 2020-02-01T14:14:05Z

Job PR-1133/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/2/index.html

zburning · 2020-02-01T14:23:59Z

@TaoLv , Yes, you are right.
Currently, max_len is only used during data preprocessing so that the sequences longer than 128 would be cut to 128. Then padding is done by gluonnlp main API nlp.data.batchify.Pad(round_to=args.round_to,...).

TaoLv · 2020-02-01T14:29:06Z

Got it. Thank you for the explanation. So even max_len is set in the command line, the final length in the computation may be larger than the max_len if round_to is also set.

zburning · 2020-02-01T14:44:10Z

Yes, it seems to be a bit confusing... A clearer way is always setting round_to=max_len. But by introducing round_to, it can be more flexible for other requirements.

leezu

Please add a test case; then LGTM

scripts/bert/finetune_classifier.py

zburning · 2020-02-02T03:16:23Z

@leezu For batchify.Pad(), there are already test cases. Do you mean checking the model outputs with round_to VS outputs without round_to?

mli · 2020-02-02T03:56:03Z

Job PR-1133/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/3/index.html

mli · 2020-02-02T05:26:18Z

Job PR-1133/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/4/index.html

mli · 2020-02-02T13:08:59Z

Job PR-1133/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/5/index.html

mli · 2020-02-02T18:48:44Z

Job PR-1133/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/6/index.html

mli · 2020-02-03T03:52:14Z

Job PR-1133/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/7/index.html

mli · 2020-02-03T04:25:27Z

Job PR-1133/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/8/index.html

mli · 2020-02-03T05:13:13Z

Job PR-1133/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/9/index.html

mli · 2020-02-03T18:45:25Z

Job PR-1133/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1133/10/index.html

zburning and others added 12 commits October 25, 2019 09:58

Merge pull request #1 from dmlc/master

41e0830

test

Update transformer.py

a10c89a

Update transformer.py

23be6c6

Update transformer.py

090a6cd

Update transformer.py

358169f

deleting trailing white space

Merge pull request #2 from dmlc/master

f49b536

merge from master

Update transformer.py

278340c

Merge pull request #3 from dmlc/master

101ce13

merge from master

Merge remote-tracking branch 'upstream/master'

d9afa4d

Merge remote-tracking branch 'upstream/master' into add_fixed_padding

c41cc38

add round_to feature

b329c3c

fix pylint

472444b

zburning requested a review from a team as a code owner February 1, 2020 04:42

update help

901159f

TaoLv approved these changes Feb 1, 2020

View reviewed changes

leezu reviewed Feb 1, 2020

View reviewed changes

eric-haibin-lin reviewed Feb 2, 2020

View reviewed changes

scripts/bert/finetune_classifier.py Outdated Show resolved Hide resolved

fix default value & update descriptions

cd5d52c

add test with round_to

0083242

merge master

2201558

fix test

1535490

leezu added the release focus Progress focus for release label Feb 3, 2020

Wang added 2 commits February 3, 2020 11:18

resolve conflict

f2a25e7

fix pylint

ce8a93c

fix test

9eb12dd

Merge branch 'master' into add_fixed_padding

b2a60a1

leezu approved these changes Feb 3, 2020

View reviewed changes

leezu merged commit 3a6a8f6 into dmlc:master Feb 3, 2020

Add round_to feature to BERT&XLNet finetuning scripts #1133

Add round_to feature to BERT&XLNet finetuning scripts #1133

Uh oh!

Conversation

zburning commented Feb 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

codecov bot commented Feb 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zburning commented Feb 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zburning commented Feb 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mli commented Feb 1, 2020

Uh oh!

sxjscience commented Feb 1, 2020

Uh oh!

TaoLv commented Feb 1, 2020

Uh oh!

zburning commented Feb 1, 2020

Uh oh!

TaoLv commented Feb 1, 2020

Uh oh!

zburning commented Feb 1, 2020

Uh oh!

TaoLv commented Feb 1, 2020

Uh oh!

mli commented Feb 1, 2020

Uh oh!

zburning commented Feb 1, 2020

Uh oh!

TaoLv commented Feb 1, 2020

Uh oh!

zburning commented Feb 1, 2020

Uh oh!

leezu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zburning commented Feb 2, 2020

Uh oh!

mli commented Feb 2, 2020

Uh oh!

mli commented Feb 2, 2020

Uh oh!

mli commented Feb 2, 2020

Uh oh!

mli commented Feb 2, 2020

Uh oh!

mli commented Feb 3, 2020

Uh oh!

mli commented Feb 3, 2020

Uh oh!

mli commented Feb 3, 2020

Uh oh!

mli commented Feb 3, 2020

Uh oh!

Uh oh!

zburning commented Feb 1, 2020 •

edited

Loading

codecov bot commented Feb 1, 2020 •

edited

Loading

zburning commented Feb 1, 2020 •

edited

Loading

zburning commented Feb 1, 2020 •

edited

Loading