Introduce faster tokenizer for BERT #1024

cmdevries · 2019-11-27T10:55:27Z

Description

This change introduces a new tokenizer for BERT that is 3.5x faster on a
2017 13 inch MacBook pro.

It was tested by tokenizing the test string u"UNwant\u00E9d,running"
from test_transforms.py::bert_tokenizer 100,000 times using the timeit
module.

The existing implementation with the Cython optmized wordpiece took
5.56 seconds and the new implementation took 1.58 seconds.

The changes were originally authored by Eric Lind [email protected]
and this commit integrates them with Gluon NLP.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

codecov · 2019-11-27T10:55:30Z

Codecov Report

Merging #1024 into master will increase coverage by 20.54%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master    #1024       +/-   ##
===========================================
+ Coverage   67.74%   88.29%   +20.54%     
===========================================
  Files          67       67               
  Lines        6340     6261       -79     
===========================================
+ Hits         4295     5528     +1233     
+ Misses       2045      733     -1312

Impacted Files	Coverage Δ
src/gluonnlp/data/transforms.py	`82.58% <100%> (+60.28%)`	⬆️
src/gluonnlp/data/word_embedding_evaluation.py	`96.94% <0%> (+0.76%)`	⬆️
src/gluonnlp/model/train/language_model.py	`96.21% <0%> (+1.62%)`	⬆️
src/gluonnlp/model/sequence_sampler.py	`35.88% <0%> (+3.13%)`	⬆️
src/gluonnlp/embedding/evaluation.py	`95.79% <0%> (+4.2%)`	⬆️
src/gluonnlp/model/transformer.py	`91.31% <0%> (+4.5%)`	⬆️
src/gluonnlp/model/bilm_encoder.py	`100% <0%> (+5.08%)`	⬆️
src/gluonnlp/model/convolutional_encoder.py	`97.67% <0%> (+6.97%)`	⬆️
src/gluonnlp/model/parameter.py	`100% <0%> (+8%)`	⬆️
src/gluonnlp/data/stream.py	`84.97% <0%> (+8.29%)`	⬆️
... and 33 more

mli · 2019-11-27T15:37:26Z

Job PR-1024/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1024/6/index.html

mli · 2019-11-27T15:57:28Z

Job PR-1024/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1024/7/index.html

This change introduces a new tokenizer for BERT that is 3.5x faster on a 2017 13 inch MacBook pro. It was tested by tokenizing the test string u"UNwant\u00E9d,running" from test_transforms.py::bert_tokenizer 100,000 times using the timeit module. The existing implementation with the Cython optmized wordpiece took 5.56 seconds and the new implementation took 1.58 seconds. The changes were originally authored by Eric Lind <[email protected]> and this commit integrates them with Gluon NLP.

mli · 2019-11-27T17:07:50Z

Job PR-1024/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1024/8/index.html

leezu

LGTM, thank you!

cmdevries requested a review from a team as a code owner November 27, 2019 10:55

cmdevries force-pushed the master branch from 7296761 to 77b16a5 Compare November 27, 2019 16:32

leezu approved these changes Nov 28, 2019

View reviewed changes

leezu merged commit bbe5375 into dmlc:master Nov 28, 2019

leezu mentioned this pull request Jan 15, 2020

[WIP] Fix BERT Japanese Tokenization #840

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce faster tokenizer for BERT #1024

Introduce faster tokenizer for BERT #1024

Uh oh!

cmdevries commented Nov 27, 2019 •

edited

Loading

Uh oh!

codecov bot commented Nov 27, 2019 •

edited

Loading

Uh oh!

mli commented Nov 27, 2019

Uh oh!

mli commented Nov 27, 2019

Uh oh!

mli commented Nov 27, 2019

Uh oh!

leezu left a comment

Uh oh!

Uh oh!

Introduce faster tokenizer for BERT #1024

Introduce faster tokenizer for BERT #1024

Uh oh!

Conversation

cmdevries commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

codecov bot commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mli commented Nov 27, 2019

Uh oh!

mli commented Nov 27, 2019

Uh oh!

mli commented Nov 27, 2019

Uh oh!

leezu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cmdevries commented Nov 27, 2019 •

edited

Loading

codecov bot commented Nov 27, 2019 •

edited

Loading