Replies: 2 comments
-
What you are describing is a parse tree with interleaved sub-language content (comments). The off-channel content (comment) is parsed in a separate language. You can do this with Antlr, but it will need to be done in several phases.
This problem is pretty common. An example is the postgresql grammar, where we have postgresql and meta commands mixed in input. The meta commands are not part of postgresql. Originally, someone tried to combine the grammars of the two languages together but ended up with a huge mess. The two languages have different lexing requirements. And, the combined grammar contained many ambiguities not present in a straight postgresql grammar. |
Beta Was this translation helpful? Give feedback.
-
You can likely do this with a lexer with modes. Capture strings in normal
mode and trigger a comment mode when your comment introducers are found. In
comment mode, return different tokens than main mode until your comment
closes (remember that comments can be nested so you may or may not need to
recurse the lexer rule or maybe not.
STRING: ‘“‘ … Capture STRINGS
COMMENT_START: ‘/*’ - pushMode(commentMode) ;
C: .;
mode commentMode;
COMMAND: ‘xyz’ ;
OTHER: ‘other’ ;
…
COMMENT_END: ‘*/‘ -> popMode ;
CHAR: . -> type(C) ;
Then your parser is trivial. There is no need to write a grammar for the
commented language.
…On Thu, May 8, 2025 at 05:14 Ken Domino ***@***.***> wrote:
The parser should split it in the following structure:
file:
content: "public class Example {\n public static void main(String[] args) {\n "
comment:
opener: "/*"
body: "Nothing yet"
closer: "*/"
content: "\n }\n}"
What you are describing is a parse tree with interleaved sub-language
content (comments). The off-channel content (comment) is parsed in a
separate language. You can do this with Antlr, but it will need to be done
in several phases.
1. Create a grammar for "main" language. Parse input using "main"
language (I guess csharp
<https://github.com/antlr/grammars-v4/tree/master/csharp>). The result
is a parse tree and token stream with (off-channel) comments.
2. Create a grammar for "sub" language. Parse each off-channel comment
token and compute a tree an sub token stream.
3. Write a tree walker that constructs a new tree that stitches the
two types of parse trees together.
This problem is pretty common. An example is the postgresql grammar
<https://github.com/antlr/grammars-v4/commits/master/sql/postgresql>,
where we have postgresql and meta commands mixed in input. The meta
commands are not part of postgresql.
Originally, someone tried to combine the grammars of the two languages
together but ended up with a huge mess. The two languages have different
lexing requirements. And, the combined grammar contained many ambiguities
not present in a straight postgresql grammar.
—
Reply to this email directly, view it on GitHub
<#4831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMDOK3VIOVDCDPFN6AT25M4C3AVCNFSM6AAAAAB4UXCCVGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGMBXGQYTENA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
My goal is to process comment contents of a file, while preserving the rest. For example in a given file:
The parser should split it in the following structure:
The ANTLR documentation goes into escaping comments, however in this case they must be preserved, and it seems like the concept of "match everything, but ..." is not straightforward (+ of course dealing with fake comments inside strings is a hassle).
How would I go about making grammar that accomplishes this?
In my case I'm making a file preprocessor that uses simple code snippets inside comments that are ignored by the underlying language, but instead evaluated by the preprocessor. Since from the perspective of the processor the underlying language is just text, I parse the content in two passes:
I would like to use ANTLR with this separation, so the parser for the first pass can be swapped for different file formats, allowing the second pass parser to work on the same structure.
Beta Was this translation helpful? Give feedback.
All reactions