Skip to content

json2token performance #266

@benjaminfh

Description

@benjaminfh

Hi team, thanks for all the efforts on this model – it looks super promising and powerful and is a great contribution. Context on my side is that I'm been using this as a learning exercise to get to grips with the latest model architectures (aka, this issue might have some built-in naiveté!)

I've prepped a dataset for my specific use case that I'd like to train Donut on. I have close enough to 10k images to throw at it if I wish. When playing with a training set of this scale I noticed the perf when tokenizing the gt_parse data is noticeably slow – it'd cost me about 2 hours of GPU time to do the tokenizing. I traced it back to this line in utils.py

self.gt_token_sequences.append(

It calls a fairly expensive tokenizer method in transformers/tokenization_utils_base (https://github.com/huggingface/transformers/blob/c9e72f55b2dc4b9be4edb986dce0552582b328f2/src/transformers/tokenization_utils_base.py#L873) (about 50 ms per call, and for me, I need to call it about 20 times per row of training data == 1+ sec!). 99.999% of the time, this method returns that 0 special tokens were added. They were of course added from a previous row of data but it takes us 50 ms to determine this.

Since special tokens are not modified by the tokenizer (is my understanding), why do we need to use the tokenizer to test whether a given token should be added? Genuine question, conscious of my own potential ignorance for nuance here. Can we instead maintain a set of special tokens that have already been successfully tokenized and added in donut/utils and remove these from future calls to the tokenizer, crucially, skipping the call altogether when there are no new/unique special tokens for it to consider?

Happy to create a PR if this sounds sensible / I'm not missing something special that the tokenizer call is doing for <token_x> the nth time over :)

Cheers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions