-
Notifications
You must be signed in to change notification settings - Fork 528
Speedy BERTTokenizer (now with caching!) #921
Conversation
Codecov Report
|
Codecov Report
@@ Coverage Diff @@
## master #921 +/- ##
==========================================
+ Coverage 75.21% 89.8% +14.58%
==========================================
Files 67 67
Lines 6371 6356 -15
==========================================
+ Hits 4792 5708 +916
+ Misses 1579 648 -931
|
src/gluonnlp/data/transforms.py
Outdated
@@ -987,10 +990,11 @@ class BERTTokenizer: | |||
|
|||
_special_prefix = u'##' | |||
|
|||
def __init__(self, vocab, lower=True, max_input_chars_per_word=200): | |||
def __init__(self, vocab, lower=True, max_input_chars_per_word=200, optimize=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use-case of optimize=False
? Would it make sense to remove the argument and always optimize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, the LRU cache size should probably be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both davis and I are not sure about how to pass variable max_size to @functools.lru_cache(maxsize=DEFAULT_CACHE_SIZE)
. Currently the way ppl configure it is by setting nlp._constant.DEFAULT_CACHE_SIZE
. @leezu any suggestion for better ways to configure it? ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davisliang I guess we have the cython dependency already, we can have always use optimize=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can default to optimize=True. @leezu, can you advise on passing the max cache size variable on construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current solution applies the function decorator on class declaration time, thus maxsize can't be a constructor argument. One way to avoid this issue is to apply the function decorator only in __init__
, ie. adding the following expression
self._word_to_wordpiece_optimized = functools.lru_cache(maxsize=cache_maxsize)(self._word_to_wordpiece_optimized)
to __init__
and deleting line 1023 @functools.lru_cache(maxsize=DEFAULT_CACHE_SIZE)
.
With respect to making optimize=True
the default, why not delete the non-optimized implementation, always optimize and delete the optimize
argument?
How does the latency compare to using SentencePiece? Can we drop |
We cannot drop bertsptokenizer because we can learn a BPE with sentencepiece with a custom corpus |
Isn't the learned BPE tokenization fully specified via the vocabulary of BPE tokens? (We have a |
Sorry for the late reply.
I think the main difference is that users can train a unigram sentencepiece model and perform sampling during tokenization (by setting alpha and num_best), which BERTTokenizer cannot do. |
Then the functionality of @davisliang is |
Actually the functionality of |
@davisliang I rebased your commit on current master and added more commits to address the review. Hope that's fine with you. @dmlc/gluon-nlp-team please help to review |
Job PR-921/7 is complete. |
Ping for review @dmlc/gluon-nlp-reviewers |
Thanks for the great supplementary work, @leezu! |
Description
Cythonized bpe tokenization + lru caching.
Checklist
Essentials
Changes
Comments