-
Notifications
You must be signed in to change notification settings - Fork 130
Open
Labels
tokenizationrelated to tokenizersrelated to tokenizers
Description
I have a couple of suggestions for the tokenizer API -- things that I have needed to work around here: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Tokenizer.swift
-
add
eosToken
/eosTokenId
to theTokenizer
protocol- this is needed to know when to stop producing tokens
Tokenizer
already hasunknownToken
- I don't know if any of the other special tokens should be exposed, e.g.
bosToken
-
have a way to add to
TokenizerModel/knownTokenizers
or otherwise handle unknown tokenizers- right now it would probably be sufficient to map to
"PreTrainedTokenizer": BPETokenizer.self
- but in the future this might need to be more flexible
TokenizerModel
is internal as are the various classes likeBPETokenizer
- in my workaround I mapped string -> string, e.g.
"Qwen2Tokenizer": "PreTrainedTokenizer"
, which is perhaps the right level -- not exposing too much of the implementation - anyway, some kind of API to allow registration of overrides like this or perhaps just "PreTrainedTokenizer" as a fallback for now
- right now it would probably be sufficient to map to
If these fit in with the vision for the tokenizer API, please consider them!
Thanks
Metadata
Metadata
Assignees
Labels
tokenizationrelated to tokenizersrelated to tokenizers