-
Notifications
You must be signed in to change notification settings - Fork 532
Description
Hi team, thanks for all the efforts on this model – it looks super promising and powerful and is a great contribution. Context on my side is that I'm been using this as a learning exercise to get to grips with the latest model architectures (aka, this issue might have some built-in naiveté!)
I've prepped a dataset for my specific use case that I'd like to train Donut on. I have close enough to 10k images to throw at it if I wish. When playing with a training set of this scale I noticed the perf when tokenizing the gt_parse data is noticeably slow – it'd cost me about 2 hours of GPU time to do the tokenizing. I traced it back to this line in utils.py
Line 77 in 4cfcf97
self.gt_token_sequences.append( |
It calls a fairly expensive tokenizer method in transformers/tokenization_utils_base (https://github.com/huggingface/transformers/blob/c9e72f55b2dc4b9be4edb986dce0552582b328f2/src/transformers/tokenization_utils_base.py#L873) (about 50 ms per call, and for me, I need to call it about 20 times per row of training data == 1+ sec!). 99.999% of the time, this method returns that 0 special tokens were added. They were of course added from a previous row of data but it takes us 50 ms to determine this.
Since special tokens are not modified by the tokenizer (is my understanding), why do we need to use the tokenizer to test whether a given token should be added? Genuine question, conscious of my own potential ignorance for nuance here. Can we instead maintain a set of special tokens that have already been successfully tokenized and added in donut/utils and remove these from future calls to the tokenizer, crucially, skipping the call altogether when there are no new/unique special tokens for it to consider?
Happy to create a PR if this sounds sensible / I'm not missing something special that the tokenizer call is doing for <token_x> the nth time over :)
Cheers!