Speed up Tokenization by optimizing cast_to_python_objects #523

lhoestq · 2020-08-20T09:42:02Z

I changed how cast_to_python_objects works to make it faster.
It is used to cast numpy/pytorch/tensorflow/pandas objects to python lists, and it works recursively.

To avoid iterating over possibly long lists, it first checks if the first element that is not None has to be casted.
If the first element needs to be casted, then all the elements of the list will be casted, otherwise they'll stay the same.
This trick allows to cast objects that contain tokenizers outputs without iterating over every single token for example.

Speed improvement:

import transformers
import nlp

tok = transformers.BertTokenizerFast.from_pretrained("bert-base-uncased")
txt = ["a " * 512] * 1000
dataset = nlp.Dataset.from_dict({"txt": txt})

# Tokenization using .map is now faster. Previously it was taking 3.5s
%time _ = dataset.map(lambda x: tok(x["txt"]), batched=True, load_from_cache_file=False)
# 450ms

# for comparison
%time _ = tok(txt)
# 280ms

thomwolf

Nice! Maybe we can add a few tests on these behaviors?

src/nlp/features.py

lhoestq · 2020-08-20T13:29:02Z

I took your comments into account and added tests for cast_to_python_objects

lhoestq requested a review from thomwolf August 20, 2020 09:42

thomwolf approved these changes Aug 20, 2020

View reviewed changes

src/nlp/features.py Outdated Show resolved Hide resolved

src/nlp/features.py Outdated Show resolved Hide resolved

lhoestq added 4 commits August 24, 2020 10:29

optimisations for cast_to_python_objects

c6944b9

fix has_changed for empty list

ea25923

add tests for cast_to_python_objects

23cdb1e

remove map_all_sequences_to_list

b002a15

lhoestq force-pushed the speedup-tokenization-by-optimizing-cast_to_python_objects branch from 77fb7e5 to b002a15 Compare August 24, 2020 08:32

lhoestq merged commit 12a32b9 into master Aug 24, 2020

lhoestq deleted the speedup-tokenization-by-optimizing-cast_to_python_objects branch August 24, 2020 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up Tokenization by optimizing cast_to_python_objects #523

Speed up Tokenization by optimizing cast_to_python_objects #523

Uh oh!

lhoestq commented Aug 20, 2020

Uh oh!

thomwolf left a comment

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Speed up Tokenization by optimizing cast_to_python_objects #523

Speed up Tokenization by optimizing cast_to_python_objects #523

Uh oh!

Conversation

lhoestq commented Aug 20, 2020

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants