Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

cmdevries
Copy link
Contributor

@cmdevries cmdevries commented Nov 27, 2019

Description

This change introduces a new tokenizer for BERT that is 3.5x faster on a
2017 13 inch MacBook pro.

It was tested by tokenizing the test string u"UNwant\u00E9d,running"
from test_transforms.py::bert_tokenizer 100,000 times using the timeit
module.

The existing implementation with the Cython optmized wordpiece took
5.56 seconds and the new implementation took 1.58 seconds.

The changes were originally authored by Eric Lind [email protected]
and this commit integrates them with Gluon NLP.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

@cmdevries cmdevries requested a review from a team as a code owner November 27, 2019 10:55
@codecov
Copy link

codecov bot commented Nov 27, 2019

Codecov Report

Merging #1024 into master will increase coverage by 20.54%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #1024       +/-   ##
===========================================
+ Coverage   67.74%   88.29%   +20.54%     
===========================================
  Files          67       67               
  Lines        6340     6261       -79     
===========================================
+ Hits         4295     5528     +1233     
+ Misses       2045      733     -1312
Impacted Files Coverage Δ
src/gluonnlp/data/transforms.py 82.58% <100%> (+60.28%) ⬆️
src/gluonnlp/data/word_embedding_evaluation.py 96.94% <0%> (+0.76%) ⬆️
src/gluonnlp/model/train/language_model.py 96.21% <0%> (+1.62%) ⬆️
src/gluonnlp/model/sequence_sampler.py 35.88% <0%> (+3.13%) ⬆️
src/gluonnlp/embedding/evaluation.py 95.79% <0%> (+4.2%) ⬆️
src/gluonnlp/model/transformer.py 91.31% <0%> (+4.5%) ⬆️
src/gluonnlp/model/bilm_encoder.py 100% <0%> (+5.08%) ⬆️
src/gluonnlp/model/convolutional_encoder.py 97.67% <0%> (+6.97%) ⬆️
src/gluonnlp/model/parameter.py 100% <0%> (+8%) ⬆️
src/gluonnlp/data/stream.py 84.97% <0%> (+8.29%) ⬆️
... and 33 more

@mli
Copy link
Member

mli commented Nov 27, 2019

Job PR-1024/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1024/6/index.html

@mli
Copy link
Member

mli commented Nov 27, 2019

Job PR-1024/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1024/7/index.html

This change introduces a new tokenizer for BERT that is 3.5x faster on a
2017 13 inch MacBook pro.

It was tested by tokenizing the test string u"UNwant\u00E9d,running"
from test_transforms.py::bert_tokenizer 100,000 times using the timeit
module.

The existing implementation with the Cython optmized wordpiece took
5.56 seconds and the new implementation took 1.58 seconds.

The changes were originally authored by Eric Lind <[email protected]>
and this commit integrates them with Gluon NLP.
@mli
Copy link
Member

mli commented Nov 27, 2019

Job PR-1024/8 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1024/8/index.html

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@leezu leezu merged commit bbe5375 into dmlc:master Nov 28, 2019
@leezu leezu mentioned this pull request Jan 15, 2020
6 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants