-
Notifications
You must be signed in to change notification settings - Fork 90
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I'm not sure if this is a bug or intended behavior.
JSON specification forbids control characters (Unicode characters U+0000 to U+001f) in strings. For example, this is not a valid JSON:
{"text": "\tab char is illegal here"}
A grammar compiled using GrammarCompiler.compile_builtin_json_grammar()
correctly rejects this input. But a grammar compiled with GrammarCompiler.compile_json_schema()
accepts it as a valid JSON.
Standalone code snippet, tested with xgrammar-0.1.17:
import xgrammar
from transformers import AutoTokenizer, AutoConfig
INVALID_INPUT = '{"text": "\tab char is illegal here"}'
def check_if_rejects_invalid_json(grammar: xgrammar.CompiledGrammar) -> None:
matcher = xgrammar.GrammarMatcher(grammar, terminate_without_stop_token=True)
assert not matcher._debug_accept_string(INVALID_INPUT, debug_print=True)
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
tokenizer_info = xgrammar.TokenizerInfo.from_huggingface(tokenizer)
grammar_compiler = xgrammar.GrammarCompiler(tokenizer_info)
# The builtin JSON grammar correctly rejects the INVALID_INPUT
builtin_json_grammar = grammar_compiler.compile_builtin_json_grammar()
check_if_rejects_invalid_json(builtin_json_grammar)
# prints: [11:13:09] /Users/runner/work/xgrammar/xgrammar/cpp/grammar_matcher_base.cc:301: Character 9 "\t" Rejected
# A grammar compiled from a JSON schema accepts it
json_schema_grammar = grammar_compiler.compile_json_schema(
{
"type": "object",
"properties": {
"text": {"type": "string"},
},
"required": ["text"],
}
)
check_if_rejects_invalid_json(json_schema_grammar)
# raises AssertionError
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working