Ambiguous tokens #4823
Replies: 1 comment
-
Antlr4 does not have a "parser-state aware tokenizer." This is something we really could use for languages that are broken into a number of other languages, each defined in a separate spec that refers to the other specs, e.g., html/javascript/css--the backbone of the internet!! We've implemented the Antlr4 grammars for each, but they can't be composed because tokenization is different for each grammar. The lexer rules follow the two basic rules: "The rule with the longest string matches. If two or more lexer rules match a string, the first one wins." Antlr4's |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am writing an antlr3 grammer to interpret a MSSQL schema dump. There are a few issues that I simply cannot seem to get around. For example, on a table or constraint, you can have an "ON [...]" clause, so you would expect to be able to define an "ON" rule and use that in a "CREATE TABLE" rule, but then you have a number of boolean settings where "ON" and "OFF" are boolean values, so you could define rules 'ALTER' 'DATABASE' identifier 'SET' field = booleanValue or 'SET identifier booleanValue. No matter how I try to change the order, fragment tab or case of a name, it always seems to work in one instance but not the other. A similar problem is that 'PRIMARY' 'KEY' identifies 'KEY' as a token, and when I configured rules for 'WITH' '(' ('IGNORE_DUP_KEY'|...) '=' booleanValue, then a 'KEY' token is identified, rendering the WITH rule useless. Similarly, MSQQL schema dump sometimes have a 'WITH' clause that is not preceeded by white space (e.g. '...)WITH (...)'. In this instance the 'WITH' token is not identified.
In my view, a far better way to implement a tokenizer would be to identify token sets in stead of tokens. For example, SQL allows for a table to be created as 'CREATE TABLE MYTABLE (...)', so you could identify name [A..Z]+, so you could create a rule 'CREATE' 'TABLE' name '(' ... ')'. But if there is a table called FOREIGN (CREATE TABLE FOREIGN ...), the FOREIGN can be picked up as a token in stead of a name, rendering the CREATE TABLE rule somewhat useless. If the token stream at every point allows for either a constant token or some value, the parser can then more easily apply a correct rule, so after 'CONSTRAINT', you can expect 'FOREIGN' to be a token, but after 'TABLE', it may be better seen as a name.
I would appreciate any guidance on how to get around these issues.
Beta Was this translation helpful? Give feedback.
All reactions