Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 20, 2020

I changed how cast_to_python_objects works to make it faster.
It is used to cast numpy/pytorch/tensorflow/pandas objects to python lists, and it works recursively.

To avoid iterating over possibly long lists, it first checks if the first element that is not None has to be casted.
If the first element needs to be casted, then all the elements of the list will be casted, otherwise they'll stay the same.
This trick allows to cast objects that contain tokenizers outputs without iterating over every single token for example.

Speed improvement:

import transformers
import nlp

tok = transformers.BertTokenizerFast.from_pretrained("bert-base-uncased")
txt = ["a " * 512] * 1000
dataset = nlp.Dataset.from_dict({"txt": txt})

# Tokenization using .map is now faster. Previously it was taking 3.5s
%time _ = dataset.map(lambda x: tok(x["txt"]), batched=True, load_from_cache_file=False)
# 450ms

# for comparison
%time _ = tok(txt)
# 280ms

@lhoestq lhoestq requested a review from thomwolf August 20, 2020 09:42
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Maybe we can add a few tests on these behaviors?

@lhoestq
Copy link
Member Author

lhoestq commented Aug 20, 2020

I took your comments into account and added tests for cast_to_python_objects

@lhoestq lhoestq force-pushed the speedup-tokenization-by-optimizing-cast_to_python_objects branch from 77fb7e5 to b002a15 Compare August 24, 2020 08:32
@lhoestq lhoestq merged commit 12a32b9 into master Aug 24, 2020
@lhoestq lhoestq deleted the speedup-tokenization-by-optimizing-cast_to_python_objects branch August 24, 2020 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants