Speedy BERTTokenizer (now with caching!) #921

davisliang · 2019-09-07T05:12:35Z

Description

Cythonized bpe tokenization + lru caching.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov · 2019-09-07T05:16:08Z

Codecov Report

❗ No coverage uploaded for pull request head (master@1821e18). Click here to learn what that means.
The diff coverage is n/a.

codecov · 2019-09-07T05:16:08Z

Codecov Report

Merging #921 into master will increase coverage by 14.58%.
The diff coverage is 76.92%.

@@            Coverage Diff             @@
##           master    #921       +/-   ##
==========================================
+ Coverage   75.21%   89.8%   +14.58%     
==========================================
  Files          67      67               
  Lines        6371    6356       -15     
==========================================
+ Hits         4792    5708      +916     
+ Misses       1579     648      -931

Impacted Files	Coverage Δ
src/gluonnlp/data/transforms.py	`84.82% <76.92%> (+2.82%)`	⬆️
src/gluonnlp/model/train/language_model.py	`96.21% <0%> (+1.62%)`	⬆️
src/gluonnlp/data/utils.py	`74.04% <0%> (+2.29%)`	⬆️
src/gluonnlp/embedding/evaluation.py	`95.79% <0%> (+4.2%)`	⬆️
src/gluonnlp/model/bilm_encoder.py	`100% <0%> (+5.08%)`	⬆️
src/gluonnlp/model/convolutional_encoder.py	`97.67% <0%> (+6.97%)`	⬆️
src/gluonnlp/data/stream.py	`84.97% <0%> (+8.29%)`	⬆️
src/gluonnlp/model/utils.py	`77.58% <0%> (+8.62%)`	⬆️
src/gluonnlp/data/batchify/batchify.py	`96.06% <0%> (+9.44%)`	⬆️
... and 25 more

leezu · 2019-09-09T07:36:10Z

src/gluonnlp/data/transforms.py

@@ -987,10 +990,11 @@ class BERTTokenizer:

    _special_prefix = u'##'

-    def __init__(self, vocab, lower=True, max_input_chars_per_word=200):
+    def __init__(self, vocab, lower=True, max_input_chars_per_word=200, optimize=False):


What's the use-case of optimize=False? Would it make sense to remove the argument and always optimize?

If so, the LRU cache size should probably be configurable?

Both davis and I are not sure about how to pass variable max_size to @functools.lru_cache(maxsize=DEFAULT_CACHE_SIZE). Currently the way ppl configure it is by setting nlp._constant.DEFAULT_CACHE_SIZE. @leezu any suggestion for better ways to configure it? ^^

@davisliang I guess we have the cython dependency already, we can have always use optimize=True?

Yes we can default to optimize=True. @leezu, can you advise on passing the max cache size variable on construction?

The current solution applies the function decorator on class declaration time, thus maxsize can't be a constructor argument. One way to avoid this issue is to apply the function decorator only in __init__, ie. adding the following expression

self._word_to_wordpiece_optimized = functools.lru_cache(maxsize=cache_maxsize)(self._word_to_wordpiece_optimized)

to __init__ and deleting line 1023 @functools.lru_cache(maxsize=DEFAULT_CACHE_SIZE).

With respect to making optimize=True the default, why not delete the non-optimized implementation, always optimize and delete the optimize argument?

leezu · 2019-09-15T20:04:51Z

How does the latency compare to using SentencePiece? Can we drop BERTSPTokenizer?

eric-haibin-lin · 2019-09-15T21:52:46Z

We cannot drop bertsptokenizer because we can learn a BPE with sentencepiece with a custom corpus

leezu · 2019-09-16T08:06:25Z

Isn't the learned BPE tokenization fully specified via the vocabulary of BPE tokens? (We have a from_sentencepiece method to construct a vocabulary object, which we can use in that case to get the vocab of a custom sentencepiece model.) Does BERTTokenizer._tokenize_wordpiece differ from BERTSPTokenizer._tokenize_wordpiece?

eric-haibin-lin · 2019-09-23T18:23:56Z

Sorry for the late reply.

Isn't the learned BPE tokenization fully specified via the vocabulary of BPE tokens? (We have a from_sentencepiece method to construct a vocabulary object, which we can use in that case to get the vocab of a custom sentencepiece model.) Does BERTTokenizer._tokenize_wordpiece differ from BERTSPTokenizer._tokenize_wordpiece?

I think the main difference is that users can train a unigram sentencepiece model and perform sampling during tokenization (by setting alpha and num_best), which BERTTokenizer cannot do.

leezu · 2019-09-24T06:57:17Z

Then the functionality of BERTSPTokenizer is a superset of BERTTokenizer. Is the only reason for BERTTokenizer to exist to help people that can't install sentencepiece dependency? (Are there any such users?) If so, I would suggest to merge BERTSPTokenizer and BERTTokenizer into BERTTokenizer . If sentencepiece is available, BERTTokenizer could call sentencepiece. Otherwise it can call the cython implementation. If a user requests sampling, raise an error if sentencepiece is not installed. What do you think?

@davisliang is BERTTokenizer faster than BERTSPTokenizer?

leezu · 2019-09-26T10:49:28Z

Actually the functionality of BERTSPTokenizer is distinct from BERTTokenizer as one denotes whitespace as ▁ and the other uses denotes "missing" whitespace as ##.. So the class hierarchy in GluonNLP (BERTSPTokenizer a subclass of BERTTokenizer) is semantically wrong. Maybe we can improve it, separating out the common component for BERT and keeping the wordpiece and sentencepiece tokenization separate so it can be reused by other models..
(Which is a separate effort, not necessarily in this PR)

leezu · 2019-10-23T00:33:51Z

@davisliang I rebased your commit on current master and added more commits to address the review. Hope that's fine with you.

@dmlc/gluon-nlp-team please help to review

mli · 2019-10-23T02:23:02Z

Job PR-921/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-921/7/index.html

leezu · 2019-10-29T21:21:09Z

Ping for review @dmlc/gluon-nlp-reviewers

davisliang · 2019-11-01T03:16:05Z

Thanks for the great supplementary work, @leezu!

setup.py

davisliang requested a review from szha as a code owner September 7, 2019 05:12

leezu reviewed Sep 9, 2019

View reviewed changes

szha assigned leezu Sep 9, 2019

liadavis and others added 4 commits October 23, 2019 00:22

optimize tokenizer

3a745ce

reformat wordpiece.pyx

3421f5b

Only support optimized version

b0eadb1

Make LRU cache size a constructor argument

cb5a02e

leezu force-pushed the master branch from 1821e18 to cb5a02e Compare October 23, 2019 00:32

davisliang requested a review from a team as a code owner October 23, 2019 00:32

leezu added 5 commits October 23, 2019 01:27

Declare cython dependency

665d77a

Improve setup.py cython handling

9493805

Fix type annotation

295df9d

Fix lint

a529187

Fix lint

18a736d

leezu requested a review from eric-haibin-lin October 23, 2019 20:40

leezu added the release focus Progress focus for release label Oct 25, 2019

szha reviewed Nov 1, 2019

View reviewed changes

setup.py Show resolved Hide resolved

szha approved these changes Nov 1, 2019

View reviewed changes

leezu merged commit 1704ab8 into dmlc:master Nov 6, 2019

Speedy BERTTokenizer (now with caching!) #921

Speedy BERTTokenizer (now with caching!) #921

Uh oh!

Conversation

davisliang commented Sep 7, 2019

Description

Checklist

Essentials

Changes

Comments

Uh oh!

codecov bot commented Sep 7, 2019

Codecov Report

Uh oh!

codecov bot commented Sep 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leezu Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

leezu Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

davisliang Sep 13, 2019

Choose a reason for hiding this comment

Uh oh!

leezu Sep 15, 2019

Choose a reason for hiding this comment

Uh oh!

leezu commented Sep 15, 2019

Uh oh!

eric-haibin-lin commented Sep 15, 2019

Uh oh!

leezu commented Sep 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-haibin-lin commented Sep 23, 2019

Uh oh!

leezu commented Sep 24, 2019

Uh oh!

leezu commented Sep 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leezu commented Oct 23, 2019

Uh oh!

mli commented Oct 23, 2019

Uh oh!

leezu commented Oct 29, 2019

Uh oh!

davisliang commented Nov 1, 2019

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 7, 2019 •

edited

Loading

leezu commented Sep 16, 2019 •

edited

Loading

leezu commented Sep 26, 2019 •

edited

Loading