Skip to content

Incompatibility with the vLLM Mistral Tokenizer  #141

@gcalmettes

Description

@gcalmettes

The current version of lm-format-enforcer does not seem compatible with the Mistral tokenizer.

There is no known workaround and vLLM disabled guided_json when the mistral tokenizer is used (see this PR).

This prevent models like Pixtral, which requires the MistralTokenizer, to be used with lm-format-enforcer

Minimal exemple to reproduce the issue:

import vllm
from lmformatenforcer.integrations.vllm import build_vllm_token_enforcer_tokenizer_data

model_id = "mistral-community/pixtral-12b-240910"
llm = vllm.LLM(model=model_id, tokenizer_mode="mistral")
tokenizer_data = build_vllm_token_enforcer_tokenizer_data(llm)

Logs:

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/local/lib/python3.10/dist-packages/lmformatenforcer/integrations/vllm.py", line 40, in build_vllm_token_enforcer_tokenizer_data
     return build_token_enforcer_tokenizer_data(tokenizer)
   File "/usr/local/lib/python3.10/dist-packages/lmformatenforcer/integrations/transformers.py", line 77, in build_token_enforcer_tokenizer_data
     regular_tokens = _build_regular_tokens_list(tokenizer)
   File "/usr/local/lib/python3.10/dist-packages/lmformatenforcer/integrations/transformers.py", line 57, in _build_regular_tokens_list
     token_0 = tokenizer.encode("0")[-1]
 TypeError: Tekkenizer.encode() missing 2 required positional arguments: 'bos' and 'eos'

The encode method of the MistralTokenizer requires 2 additional arguments (bool): code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions