[Misc] Refactor tokenizer interface #29693

DarkLight1337 · 2025-11-28T17:25:46Z

Purpose

Consolidate AnyTokenizer and TokenizerBase by making TokenizerBase a Protocol, and rename both of them to TokenizerLike (with back-compatibility)
Remove unused attributes from TokenizerBase: sep_token, pad_token, encode_one
Move MistralTokenizer, TokenizerLike, TokenizerRegistry into vllm.tokenizers
Fix various type checking issues associated with replacing AnyTokenizer with TokenizerLike
Move tests from tests/tokenization to tests/tokenizers and update Buildkite pipeline accordingly
Consolidate related tests to fit the directory structure

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: DarkLight1337 <[email protected]>

mergify · 2025-11-28T17:26:27Z

Documentation preview: https://vllm--29693.org.readthedocs.build/en/29693/

Signed-off-by: DarkLight1337 <[email protected]>

chatgpt-codex-connector · 2025-11-29T02:34:15Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2025-11-29T05:02:15Z

/gemini review

Signed-off-by: DarkLight1337 <[email protected]>

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring that modernizes the tokenizer abstraction by introducing the TokenizerLike protocol. This change improves modularity, type safety, and flexibility across the codebase. The consolidation of tokenizer tests into a dedicated tests/tokenizers directory is also a great organizational improvement.

I've identified a couple of critical bug fixes related to handling cases where the tokenizer is not initialized (skip_tokenizer_init=True), which significantly improves the robustness of the OpenAI-compatible endpoints. These changes prevent potential crashes and provide clearer error messages to the user.

Overall, this is an excellent contribution that enhances the maintainability and reliability of vLLM's tokenization system. The changes are thorough and consistent.

vllm/entrypoints/openai/serving_completion.py

vllm/entrypoints/openai/serving_engine.py

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2025-11-29T12:02:16Z

Failing tests are known failures on main

Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added 2 commits November 28, 2025 17:17

[Misc] Convert TokenizerBase to protocol, consolidate tokenizer tests

bd9e0b3

Signed-off-by: DarkLight1337 <[email protected]>

BC

668eb2c

Signed-off-by: DarkLight1337 <[email protected]>

github-project-automation bot added this to Structured Output Nov 28, 2025

mergify bot added the tool-calling label Nov 28, 2025

github-project-automation bot added this to Tool Calling Nov 28, 2025

DarkLight1337 added 14 commits November 28, 2025 17:27

Unnecessary quote

6684cd0

Signed-off-by: DarkLight1337 <[email protected]>

Rename

9f3ab67

Signed-off-by: DarkLight1337 <[email protected]>

Forward merge

d35e431

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into tokenizer-proto

38afdfd

Signed-off-by: DarkLight1337 <[email protected]>

Oops

a616c7a

Signed-off-by: DarkLight1337 <[email protected]>

Docstring

94688c1

Signed-off-by: DarkLight1337 <[email protected]>

Increase tolerance

54787a5

Signed-off-by: DarkLight1337 <[email protected]>

[Bugfix] Fix wrong mock attribute

bb46b1a

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into tokenizer-proto

4ab6dcb

Signed-off-by: DarkLight1337 <[email protected]>

Avoid circular import

c8e948d

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'fix-serving-test' into tokenizer-proto

665305e

Fix mypy

94b9c62

Signed-off-by: DarkLight1337 <[email protected]>

Fix circular import

3dbb92c

Signed-off-by: DarkLight1337 <[email protected]>

rel import

9aeed95

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 requested review from benchislett, chaunceyjiang, mgoin, russellb and tjtanaa as code owners November 29, 2025 02:34

DarkLight1337 added 2 commits November 29, 2025 04:19

Merge branch 'main' into tokenizer-proto

5137574

Move

3866bae

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 changed the title ~~[Misc] Convert TokenizerBase to protocol, consolidate tokenizer tests~~ [Misc] Refactor tokenizer interface Nov 29, 2025

Unnecessary runtime_checkable

c59476a

Signed-off-by: DarkLight1337 <[email protected]>

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

vllm/entrypoints/openai/serving_completion.py Show resolved Hide resolved

vllm/entrypoints/openai/serving_engine.py Show resolved Hide resolved

Update import

9d974a3

Signed-off-by: DarkLight1337 <[email protected]>

Isotr0py approved these changes Nov 29, 2025

View reviewed changes

DarkLight1337 added 5 commits November 29, 2025 06:08

Avoid conflict with tokenizers package

15bf2a0

Signed-off-by: DarkLight1337 <[email protected]>

Don't run type validation on internal structures

6b558be

Signed-off-by: DarkLight1337 <[email protected]>

kw only

767f2c8

Signed-off-by: DarkLight1337 <[email protected]>

Fix mypy

02c1857

Signed-off-by: DarkLight1337 <[email protected]>

Fix test

a5fbb67

Signed-off-by: DarkLight1337 <[email protected]>

vllm-bot merged commit 34a9842 into vllm-project:main Nov 29, 2025
129 of 134 checks passed

github-project-automation bot moved this to Done in Tool Calling Nov 29, 2025

github-project-automation bot moved this to Done in Structured Output Nov 29, 2025

DarkLight1337 deleted the tokenizer-proto branch November 29, 2025 12:02

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Misc] Refactor tokenizer interface (vllm-project#29693)

97546bc

Signed-off-by: DarkLight1337 <[email protected]>

amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025

[Misc] Refactor tokenizer interface (vllm-project#29693)

d8d85bb

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Rohan138 mentioned this pull request Dec 4, 2025

[Bugfix]: Fix TokenizerLike interface #30009

Merged

5 tasks

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025

[Misc] Refactor tokenizer interface (vllm-project#29693)

5031bcc

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025

[Misc] Refactor tokenizer interface (vllm-project#29693)

87e3bec

Signed-off-by: DarkLight1337 <[email protected]>

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 9, 2025

[Misc] Refactor tokenizer interface (vllm-project#29693)

68d0957

Signed-off-by: DarkLight1337 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc] Refactor tokenizer interface #29693

[Misc] Refactor tokenizer interface #29693

Uh oh!

DarkLight1337 commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 28, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Misc] Refactor tokenizer interface #29693

[Misc] Refactor tokenizer interface #29693

Uh oh!

Conversation

DarkLight1337 commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 28, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DarkLight1337 commented Nov 28, 2025 •

edited by github-actions bot

Loading