-
Notifications
You must be signed in to change notification settings - Fork 12
Bug fix: Only append eod token once when packing / tokenizing #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lag to tokenize function
…e a speical token
mali-git
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Minor comments.
flxst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! :)
Found some minor issues and left some comments.
tests/dataloader/test_end_to_end_indexation_and_tokenization.py
Outdated
Show resolved
Hide resolved
fromm-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
Some HF tokenisers such as
xlm-roberta-largeadd special tokens (e.g., eod token) automatically when encoding text, whereas others, such asgpt2, do not add special tokens.This side-effect in the transformers library has lead to the eod token being appended twice when tokenizing / packing our data. We added a check for this and only append the eod token once now:
modalities/src/modalities/dataloader/create_packed_data.py
Lines 327 to 330 in 1c1ccdc
Additionally, we now enforce now that the eod token is a special token.
Additionally, I added a script that verifies the consistency of the indexation and tokenization of a given JSONL file. We run the indexation and tokenization routines in modalities and compare it to tokenized JSONL file to which we applied the HF tokenizer directly.
Checklist before submitting final PR
python tests/tests.py)CHANGELOG_DEV.md)