Skip to content

Conversation

@le1nux
Copy link
Member

@le1nux le1nux commented Dec 13, 2024

What does this PR do?

Some HF tokenisers such as xlm-roberta-large add special tokens (e.g., eod token) automatically when encoding text, whereas others, such as gpt2, do not add special tokens.

This side-effect in the transformers library has lead to the eod token being appended twice when tokenizing / packing our data. We added a check for this and only append the eod token once now:

if not token_byte_string.endswith(self._encoded_eos_token_as_bytes):
token_byte_string = token_byte_string + self._encoded_eos_token_as_bytes
return token_byte_string

Additionally, we now enforce now that the eod token is a special token.

Additionally, I added a script that verifies the consistency of the indexation and tokenization of a given JSONL file. We run the indexation and tokenization routines in modalities and compare it to tokenized JSONL file to which we applied the HF tokenizer directly.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux requested review from flxst and mali-git December 13, 2024 23:47
@le1nux le1nux added bug Something isn't working enhancement New feature or request labels Dec 13, 2024
@le1nux le1nux self-assigned this Dec 13, 2024
@le1nux le1nux added this to the v0.3.2 milestone Dec 13, 2024
Copy link
Member

@mali-git mali-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Minor comments.

Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! :)

Found some minor issues and left some comments.

@le1nux le1nux requested review from flxst and mali-git December 16, 2024 16:21
Copy link
Member

@fromm-m fromm-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@le1nux le1nux merged commit 13f1a26 into main Jan 13, 2025
3 checks passed
@le1nux le1nux deleted the fix_only_append_eod_token_once branch January 13, 2025 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants