Skip to content

Grammars compiled from JSON schemas accept invalid JSON input #286

@sfc-gh-azawlocki

Description

@sfc-gh-azawlocki

I'm not sure if this is a bug or intended behavior.

JSON specification forbids control characters (Unicode characters U+0000 to U+001f) in strings. For example, this is not a valid JSON:

{"text": "\tab char is illegal here"}

A grammar compiled using GrammarCompiler.compile_builtin_json_grammar() correctly rejects this input. But a grammar compiled with GrammarCompiler.compile_json_schema() accepts it as a valid JSON.

Standalone code snippet, tested with xgrammar-0.1.17:

import xgrammar
from transformers import AutoTokenizer, AutoConfig

INVALID_INPUT = '{"text": "\tab char is illegal here"}'


def check_if_rejects_invalid_json(grammar: xgrammar.CompiledGrammar) -> None:
    matcher = xgrammar.GrammarMatcher(grammar, terminate_without_stop_token=True)
    assert not matcher._debug_accept_string(INVALID_INPUT, debug_print=True)


if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
    tokenizer_info = xgrammar.TokenizerInfo.from_huggingface(tokenizer)

    grammar_compiler = xgrammar.GrammarCompiler(tokenizer_info)

    # The builtin JSON grammar correctly rejects the INVALID_INPUT
    builtin_json_grammar = grammar_compiler.compile_builtin_json_grammar()
    check_if_rejects_invalid_json(builtin_json_grammar)
    # prints: [11:13:09] /Users/runner/work/xgrammar/xgrammar/cpp/grammar_matcher_base.cc:301: Character 9 "\t" Rejected

    # A grammar compiled from a JSON schema accepts it
    json_schema_grammar = grammar_compiler.compile_json_schema(
        {
            "type": "object",
            "properties": {
                "text": {"type": "string"},
            },
            "required": ["text"],
        }
    )
    check_if_rejects_invalid_json(json_schema_grammar)
    # raises AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions