-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
Objectives
Today tokenization is done by an embedded HuggingFace (HF) Rust tokenizer bindings. Chat-templating is done by an embedded Python interpreter that calls the relevant HF libraries.
For production we would like to support a deployment of disaggregated preprocessing as well. A service that handles prompt -> tokens for all APIs (the entire preprocessing step), ideally using code and features from vLLM.
Phased approach
- Service for tokenization and templating that runs HF (transformers) code
- A vLLM-based efficient and lightweight preprocessing service
- This is the most scalable/future-proof approach, since preprocessing is getting more complex by time and is efficiently implemented by vLLM maintainers
Metadata
Metadata
Assignees
Labels
No labels