-
Notifications
You must be signed in to change notification settings - Fork 130
Open
Labels
bugSomething isn't workingSomething isn't workingtokenizationrelated to tokenizersrelated to tokenizers
Description
Hi, I found the tokenizer behavior different from Python transformers when I use Phi-3 model.
swift-transformers
func testTokenizer() async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed")
let inputIds = tokenizer(" Hi")
print(inputIds)
// output: [1, 6324]
}
Python transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed")
input_ids = tokenizer.encode(" Hi")
print(input_ids)
# output: [1, 29871, 6324]
Python transformers prepends 29871
(▁
) before 6324
. It seems to be done by the normalizer. I debugged this issue and found that the normalizer is ignored when legacy
is false
at
swift-transformers/Sources/Tokenizers/Tokenizer.swift
Lines 341 to 344 in fc65432
if !isLegacy { | |
configDictionary.removeValue(forKey: "normalizer") | |
configDictionary["pre_tokenizer"] = ["type": "Metaspace", "replacement": sentencePieceUnderline, "add_prefix_space": true, "prepend_scheme": "first"] | |
} |
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingtokenizationrelated to tokenizersrelated to tokenizers