Ambiguous tokens #4823

johnbester · 2025-04-23T10:54:41Z

johnbester
Apr 23, 2025

I am writing an antlr3 grammer to interpret a MSSQL schema dump. There are a few issues that I simply cannot seem to get around. For example, on a table or constraint, you can have an "ON [...]" clause, so you would expect to be able to define an "ON" rule and use that in a "CREATE TABLE" rule, but then you have a number of boolean settings where "ON" and "OFF" are boolean values, so you could define rules 'ALTER' 'DATABASE' identifier 'SET' field = booleanValue or 'SET identifier booleanValue. No matter how I try to change the order, fragment tab or case of a name, it always seems to work in one instance but not the other. A similar problem is that 'PRIMARY' 'KEY' identifies 'KEY' as a token, and when I configured rules for 'WITH' '(' ('IGNORE_DUP_KEY'|...) '=' booleanValue, then a 'KEY' token is identified, rendering the WITH rule useless. Similarly, MSQQL schema dump sometimes have a 'WITH' clause that is not preceeded by white space (e.g. '...)WITH (...)'. In this instance the 'WITH' token is not identified.

In my view, a far better way to implement a tokenizer would be to identify token sets in stead of tokens. For example, SQL allows for a table to be created as 'CREATE TABLE MYTABLE (...)', so you could identify name [A..Z]+, so you could create a rule 'CREATE' 'TABLE' name '(' ... ')'. But if there is a table called FOREIGN (CREATE TABLE FOREIGN ...), the FOREIGN can be picked up as a token in stead of a name, rendering the CREATE TABLE rule somewhat useless. If the token stream at every point allows for either a constant token or some value, the parser can then more easily apply a correct rule, so after 'CONSTRAINT', you can expect 'FOREIGN' to be a token, but after 'TABLE', it may be better seen as a name.

I would appreciate any guidance on how to get around these issues.

kaby76 · 2025-04-23T11:58:02Z

kaby76
Apr 23, 2025

Antlr4 does not have a "parser-state aware tokenizer." This is something we really could use for languages that are broken into a number of other languages, each defined in a separate spec that refers to the other specs, e.g., html/javascript/css--the backbone of the internet!! We've implemented the Antlr4 grammars for each, but they can't be composed because tokenization is different for each grammar. The lexer rules follow the two basic rules: "The rule with the longest string matches. If two or more lexer rules match a string, the first one wins." Antlr4's import does not work because import pours all rules into a global, flat, ordered rule space. Import does work for things like the TrapC grammar, which is a memory-safe C, based off the C grammar with a few parser rules that are different--but with no differences at the lexer level. We had many discussions on this topic in grammars-v4. It can't be done elegantly in Antlr4. For your problem, the workaround is either a parser-rule wrapper or lexer modes. P.S.: This repo is for Antlr4, not Antlr3.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ambiguous tokens #4823

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Ambiguous tokens #4823

Uh oh!

johnbester Apr 23, 2025

Replies: 1 comment

Uh oh!

Uh oh!

kaby76 Apr 23, 2025

johnbester
Apr 23, 2025

kaby76
Apr 23, 2025