Skip to content

Conversation

@DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Nov 28, 2025

Purpose

  • Consolidate AnyTokenizer and TokenizerBase by making TokenizerBase a Protocol, and rename both of them to TokenizerLike (with back-compatibility)
  • Remove unused attributes from TokenizerBase: sep_token, pad_token, encode_one
  • Move MistralTokenizer, TokenizerLike, TokenizerRegistry into vllm.tokenizers
  • Fix various type checking issues associated with replacing AnyTokenizer with TokenizerLike
  • Move tests from tests/tokenization to tests/tokenizers and update Buildkite pipeline accordingly
  • Consolidate related tests to fit the directory structure

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Nov 28, 2025

Documentation preview: https://vllm--29693.org.readthedocs.build/en/29693/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) performance Performance-related issues qwen Related to Qwen models structured-output v1 labels Nov 28, 2025
@mergify mergify bot added the tool-calling label Nov 28, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@DarkLight1337
Copy link
Member Author

/gemini review

@DarkLight1337 DarkLight1337 changed the title [Misc] Convert TokenizerBase to protocol, consolidate tokenizer tests [Misc] Refactor tokenizer interface Nov 29, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring that modernizes the tokenizer abstraction by introducing the TokenizerLike protocol. This change improves modularity, type safety, and flexibility across the codebase. The consolidation of tokenizer tests into a dedicated tests/tokenizers directory is also a great organizational improvement.

I've identified a couple of critical bug fixes related to handling cases where the tokenizer is not initialized (skip_tokenizer_init=True), which significantly improves the robustness of the OpenAI-compatible endpoints. These changes prevent potential crashes and provide clearer error messages to the user.

Overall, this is an excellent contribution that enhances the maintainability and reliability of vLLM's tokenization system. The changes are thorough and consistent.

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

Failing tests are known failures on main

@vllm-bot vllm-bot merged commit 34a9842 into vllm-project:main Nov 29, 2025
129 of 134 checks passed
@DarkLight1337 DarkLight1337 deleted the tokenizer-proto branch November 29, 2025 12:02
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs structured-output tool-calling v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants