Skip to content

Conversation

maxreciprocate
Copy link
Collaborator

@maxreciprocate maxreciprocate commented Apr 14, 2023

This PR fixes a bug under which it was possible to run into negative indices with tokenizer's left truncation

https://wandb.ai/sorry/trlx/runs/yn59xu9i

self.tokenizer = AutoTokenizer.from_pretrained("gpt2")

def test_tokenize_dialogue_truncation(self):
dialogue = ["will be truncated", "ø" * 1024]
Copy link
Contributor

@LouisCastricato LouisCastricato Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably use something that is ASCII rather than unicode.

Copy link
Collaborator

@cat-state cat-state left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw one potential edge case but LGTM to me otherwise

bos_token = tokenizer.bos_token or tokenizer.eos_token
dialogue = [bos_token, dialogue]
elif isinstance(dialogue, tuple):
elif isinstance(dialogue, Iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could update the type in the signature too

@maxreciprocate maxreciprocate merged commit 9bc0836 into main Apr 17, 2023
@maxreciprocate maxreciprocate deleted the fix-ilql-negative-indexing branch April 17, 2023 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants