Skip to content

Disaggregated Tokenization (Tokenization + Chat-Templating) #126

@vMaroon

Description

@vMaroon

Objectives

Today tokenization is done by an embedded HuggingFace (HF) Rust tokenizer bindings. Chat-templating is done by an embedded Python interpreter that calls the relevant HF libraries.

For production we would like to support a deployment of disaggregated preprocessing as well. A service that handles prompt -> tokens for all APIs (the entire preprocessing step), ideally using code and features from vLLM.

Phased approach

  1. Service for tokenization and templating that runs HF (transformers) code
  2. A vLLM-based efficient and lightweight preprocessing service
    • This is the most scalable/future-proof approach, since preprocessing is getting more complex by time and is efficiently implemented by vLLM maintainers

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions