Bug fix: Only append eod token once when packing / tokenizing #283

le1nux · 2024-12-13T23:46:55Z

What does this PR do?

Some HF tokenisers such as xlm-roberta-large add special tokens (e.g., eod token) automatically when encoding text, whereas others, such as gpt2, do not add special tokens.

This side-effect in the transformers library has lead to the eod token being appended twice when tokenizing / packing our data. We added a check for this and only append the eod token once now:

modalities/src/modalities/dataloader/create_packed_data.py

Lines 327 to 330 in 1c1ccdc

    
           if not token_byte_string.endswith(self._encoded_eos_token_as_bytes): 
        
               token_byte_string = token_byte_string + self._encoded_eos_token_as_bytes 
        
           return token_byte_string

Additionally, we now enforce now that the eod token is a special token.

Additionally, I added a script that verifies the consistency of the indexation and tokenization of a given JSONL file. We run the indexation and tokenization routines in modalities and compare it to tokenized JSONL file to which we applied the HF tokenizer directly.

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…d the token

…lag to tokenize function

…e a speical token

src/modalities/utils/verify_hf_tokenization_constistency.py

mali-git

LGTM! Minor comments.

src/modalities/utils/verify_hf_tokenization_constistency.py

tests/dataloader/test_large_file_lines_reader.py

…e unit testing

flxst

Great work! :)

Found some minor issues and left some comments.

src/modalities/dataloader/create_packed_data.py

src/modalities/tokenization/tokenizer_wrapper.py

tests/dataloader/test_end_to_end_indexation_and_tokenization.py

src/modalities/utils/verify_tokenization_constistency.py

fromm-m

LGTM

src/modalities/dataloader/create_packed_data.py

…eturns an integer value.

le1nux added 11 commits December 12, 2024 11:18

refactor: fixed wrong number of bytes per token calculation

6839941

feat: aded test for _get_required_num_of_bytes_to_repr

38e7582

chore: updated CHANGELOG_DEV

a40069f

refactor: fixed mismatch between char and byte index

ea23f0e

refactor: improved the test for index creation

cd54ae2

chore: updated CHANGELOG_DEV.md

e3e9d67

refactor: the num bytes per token is now power of two

0483362

chore: updated changelog

88b1201

fix: only appending the eod token when tokenizer has not already adde…

26f6b51

…d the token

feat: added verification script for indexation and tokenization

1c1ccdc

chore: updated changelog

631ebcb

le1nux requested review from flxst and mali-git December 13, 2024 23:47

le1nux added bug Something isn't working enhancement New feature or request labels Dec 13, 2024

le1nux self-assigned this Dec 13, 2024

le1nux added this to the v0.3.2 milestone Dec 13, 2024

le1nux added 2 commits December 14, 2024 01:14

feat: added is_special_token_id to tokenizer and add_special_tokens f…

957a1da

…lag to tokenize function

feat: added check to PackedDataGenerator enforcing the eod token to b…

f41fcd3

…e a speical token

mali-git reviewed Dec 14, 2024

View reviewed changes

src/modalities/utils/verify_hf_tokenization_constistency.py Outdated Show resolved Hide resolved

mali-git approved these changes Dec 14, 2024

View reviewed changes

src/modalities/utils/verify_hf_tokenization_constistency.py Outdated Show resolved Hide resolved

tests/dataloader/test_large_file_lines_reader.py Outdated Show resolved Hide resolved

le1nux added 3 commits December 15, 2024 15:56

refactor: improved consistency between HF and SP tokenizers

a3f911e

refactor: improved consistency between HF and SP tokenizers

8c31405

refactor: polished the indexation and tokenization including extensiv…

6c53d8a

…e unit testing

flxst requested changes Dec 16, 2024

View reviewed changes

refactor: included requested review changes

85f13ac

le1nux requested review from flxst and mali-git December 16, 2024 16:21

flxst approved these changes Dec 16, 2024

View reviewed changes

fromm-m approved these changes Jan 13, 2025

View reviewed changes

src/modalities/dataloader/create_packed_data.py Outdated Show resolved Hide resolved

mali-git approved these changes Jan 13, 2025

View reviewed changes

src/modalities/dataloader/create_packed_data.py Show resolved Hide resolved

le1nux added 2 commits January 13, 2025 18:12

chore: added further checks making sure that tokenizer.get_token_id r…

14718cc

…eturns an integer value.

chore: fixed data paths in tokenizer tests

4648f1b

le1nux merged commit 13f1a26 into main Jan 13, 2025
3 checks passed

le1nux deleted the fix_only_append_eod_token_once branch January 13, 2025 18:26

le1nux added a commit that referenced this pull request Jan 13, 2025

refactor: integrated changes from PR #283 into tokenization strategy

716a98c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug fix: Only append eod token once when packing / tokenizing #283

Bug fix: Only append eod token once when packing / tokenizing #283

Uh oh!

le1nux commented Dec 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

mali-git left a comment

Uh oh!

Uh oh!

Uh oh!

flxst left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fromm-m left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if not token_byte_string.endswith(self._encoded_eos_token_as_bytes):
	token_byte_string = token_byte_string + self._encoded_eos_token_as_bytes
	return token_byte_string

Bug fix: Only append eod token once when packing / tokenizing #283

Bug fix: Only append eod token once when packing / tokenizing #283

Uh oh!

Conversation

le1nux commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist before submitting final PR

Uh oh!

Uh oh!

mali-git left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fromm-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

le1nux commented Dec 13, 2024 •

edited

Loading