Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

Conversation

davisliang
Copy link

Description

Cythonized bpe tokenization + lru caching.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@davisliang davisliang requested a review from szha as a code owner September 7, 2019 05:12
@codecov
Copy link

codecov bot commented Sep 7, 2019

Codecov Report

❗ No coverage uploaded for pull request head (master@1821e18). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented Sep 7, 2019

Codecov Report

Merging #921 into master will increase coverage by 14.58%.
The diff coverage is 76.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #921       +/-   ##
==========================================
+ Coverage   75.21%   89.8%   +14.58%     
==========================================
  Files          67      67               
  Lines        6371    6356       -15     
==========================================
+ Hits         4792    5708      +916     
+ Misses       1579     648      -931
Impacted Files Coverage Δ
src/gluonnlp/data/transforms.py 84.82% <76.92%> (+2.82%) ⬆️
src/gluonnlp/model/train/language_model.py 96.21% <0%> (+1.62%) ⬆️
src/gluonnlp/data/utils.py 74.04% <0%> (+2.29%) ⬆️
src/gluonnlp/embedding/evaluation.py 95.79% <0%> (+4.2%) ⬆️
src/gluonnlp/model/bilm_encoder.py 100% <0%> (+5.08%) ⬆️
src/gluonnlp/model/convolutional_encoder.py 97.67% <0%> (+6.97%) ⬆️
src/gluonnlp/data/stream.py 84.97% <0%> (+8.29%) ⬆️
src/gluonnlp/model/utils.py 77.58% <0%> (+8.62%) ⬆️
src/gluonnlp/data/batchify/batchify.py 96.06% <0%> (+9.44%) ⬆️
... and 25 more

@@ -987,10 +990,11 @@ class BERTTokenizer:

_special_prefix = u'##'

def __init__(self, vocab, lower=True, max_input_chars_per_word=200):
def __init__(self, vocab, lower=True, max_input_chars_per_word=200, optimize=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use-case of optimize=False? Would it make sense to remove the argument and always optimize?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, the LRU cache size should probably be configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both davis and I are not sure about how to pass variable max_size to @functools.lru_cache(maxsize=DEFAULT_CACHE_SIZE). Currently the way ppl configure it is by setting nlp._constant.DEFAULT_CACHE_SIZE. @leezu any suggestion for better ways to configure it? ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davisliang I guess we have the cython dependency already, we can have always use optimize=True?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can default to optimize=True. @leezu, can you advise on passing the max cache size variable on construction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current solution applies the function decorator on class declaration time, thus maxsize can't be a constructor argument. One way to avoid this issue is to apply the function decorator only in __init__, ie. adding the following expression

self._word_to_wordpiece_optimized = functools.lru_cache(maxsize=cache_maxsize)(self._word_to_wordpiece_optimized)

to __init__ and deleting line 1023 @functools.lru_cache(maxsize=DEFAULT_CACHE_SIZE).

With respect to making optimize=True the default, why not delete the non-optimized implementation, always optimize and delete the optimize argument?

@leezu
Copy link
Contributor

leezu commented Sep 15, 2019

How does the latency compare to using SentencePiece? Can we drop BERTSPTokenizer?

@eric-haibin-lin
Copy link
Member

We cannot drop bertsptokenizer because we can learn a BPE with sentencepiece with a custom corpus

@leezu
Copy link
Contributor

leezu commented Sep 16, 2019

Isn't the learned BPE tokenization fully specified via the vocabulary of BPE tokens? (We have a from_sentencepiece method to construct a vocabulary object, which we can use in that case to get the vocab of a custom sentencepiece model.) Does BERTTokenizer._tokenize_wordpiece differ from BERTSPTokenizer._tokenize_wordpiece?

@eric-haibin-lin
Copy link
Member

Sorry for the late reply.

Isn't the learned BPE tokenization fully specified via the vocabulary of BPE tokens? (We have a from_sentencepiece method to construct a vocabulary object, which we can use in that case to get the vocab of a custom sentencepiece model.) Does BERTTokenizer._tokenize_wordpiece differ from BERTSPTokenizer._tokenize_wordpiece?

I think the main difference is that users can train a unigram sentencepiece model and perform sampling during tokenization (by setting alpha and num_best), which BERTTokenizer cannot do.

@leezu
Copy link
Contributor

leezu commented Sep 24, 2019

Then the functionality of BERTSPTokenizer is a superset of BERTTokenizer. Is the only reason for BERTTokenizer to exist to help people that can't install sentencepiece dependency? (Are there any such users?) If so, I would suggest to merge BERTSPTokenizer and BERTTokenizer into BERTTokenizer . If sentencepiece is available, BERTTokenizer could call sentencepiece. Otherwise it can call the cython implementation. If a user requests sampling, raise an error if sentencepiece is not installed. What do you think?

@davisliang is BERTTokenizer faster than BERTSPTokenizer?

@leezu
Copy link
Contributor

leezu commented Sep 26, 2019

Actually the functionality of BERTSPTokenizer is distinct from BERTTokenizer as one denotes whitespace as and the other uses denotes "missing" whitespace as ##.. So the class hierarchy in GluonNLP (BERTSPTokenizer a subclass of BERTTokenizer) is semantically wrong. Maybe we can improve it, separating out the common component for BERT and keeping the wordpiece and sentencepiece tokenization separate so it can be reused by other models..
(Which is a separate effort, not necessarily in this PR)

@davisliang davisliang requested a review from a team as a code owner October 23, 2019 00:32
@leezu
Copy link
Contributor

leezu commented Oct 23, 2019

@davisliang I rebased your commit on current master and added more commits to address the review. Hope that's fine with you.

@dmlc/gluon-nlp-team please help to review

@mli
Copy link
Member

mli commented Oct 23, 2019

Job PR-921/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-921/7/index.html

@leezu leezu requested a review from eric-haibin-lin October 23, 2019 20:40
@leezu leezu added the release focus Progress focus for release label Oct 25, 2019
@leezu
Copy link
Contributor

leezu commented Oct 29, 2019

Ping for review @dmlc/gluon-nlp-reviewers

@davisliang
Copy link
Author

Thanks for the great supplementary work, @leezu!

@leezu leezu merged commit 1704ab8 into dmlc:master Nov 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants