Skip to content

Conversation

@ved1beta
Copy link
Contributor

@ved1beta ved1beta commented Apr 7, 2025

Problem

The add_special_tokens=False parameter in the tokenizer's encode/encode_batch methods doesn't work as expected. Even when set to False, special tokens (like EOS) are still being added to the encoded output.

Root Cause

The issue occurs because the add_special_tokens parameter is not being passed to the base tokenizer's encode_batch method. While our code correctly handles the parameter after encoding, by that point the base tokenizer may have already added special tokens.

Solution

This PR adds a check to see if the base tokenizer's encode_batch method supports the add_special_tokens parameter, and if so, passes it to ensure special tokens are not added by the base tokenizer. This provides backward compatibility with tokenizers that don't support this parameter.

Testing

I've verified the fix by testing encodings with and without special tokens:

  • add_special_tokens=False now correctly returns only content tokens without the EOS token
  • add_special_tokens=True continues to work as before, returning content tokens plus the EOS token

Fixes #765

batch_encoding = self.base_tokenizer.encode_batch(inputs)
# Check if the base tokenizer's encode_batch method supports add_special_tokens parameter
if 'add_special_tokens' in inspect.signature(self.base_tokenizer.encode_batch).parameters:
batch_encoding = self.base_tokenizer.encode_batch(inputs, add_special_tokens=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes add_special_tokens to False. Did you mean to pass it through from the arguments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tokenizer.encode function`s param add_special_tokens=False not work.

2 participants