Separating comment blocks from a source file #4831

kikugie · 2025-05-07T19:34:50Z

kikugie
May 7, 2025

My goal is to process comment contents of a file, while preserving the rest. For example in a given file:

public class Example {
    public static void main(String[] args) {
        /*Nothing yet*/
    }
}

The parser should split it in the following structure:

file:
  content: "public class Example {\n    public static void main(String[] args) {\n        "
  comment:
    opener: "/*"
    body: "Nothing yet"
    closer: "*/"
  content: "\n    }\n}"

The ANTLR documentation goes into escaping comments, however in this case they must be preserved, and it seems like the concept of "match everything, but ..." is not straightforward (+ of course dealing with fake comments inside strings is a hassle).
How would I go about making grammar that accomplishes this?

In my case I'm making a file preprocessor that uses simple code snippets inside comments that are ignored by the underlying language, but instead evaluated by the preprocessor. Since from the perspective of the processor the underlying language is just text, I parse the content in two passes:

Separate comments from code content: https://codeberg.org/stonecutter/stonecutter/src/branch/0.6/stitcher/src/main/kotlin/dev/kikugie/stitcher/scanner/Scanner.kt
Process relevant comments individually: https://codeberg.org/stonecutter/stonecutter/src/branch/0.6/stitcher/src/main/kotlin/dev/kikugie/stitcher/parser/CommentParser.kt

I would like to use ANTLR with this separation, so the parser for the first pass can be swapped for different file formats, allowing the second pass parser to work on the same structure.

kaby76 · 2025-05-08T11:14:28Z

kaby76
May 8, 2025

The parser should split it in the following structure:

file:
  content: "public class Example {\n    public static void main(String[] args) {\n        "
  comment:
    opener: "/*"
    body: "Nothing yet"
    closer: "*/"
  content: "\n    }\n}"

What you are describing is a parse tree with interleaved sub-language content (comments). The off-channel content (comment) is parsed in a separate language. You can do this with Antlr, but it will need to be done in several phases.

Create a grammar for "main" language. Parse input using "main" language (I guess csharp). The result is a parse tree and token stream with (off-channel) comments.
Create a grammar for "sub" language. Parse each off-channel comment token and compute a tree an sub token stream.
Write a tree walker that constructs a new tree that stitches the two types of parse trees together.

This problem is pretty common. An example is the postgresql grammar, where we have postgresql and meta commands mixed in input. The meta commands are not part of postgresql.

Originally, someone tried to combine the grammars of the two languages together but ended up with a huge mess. The two languages have different lexing requirements. And, the combined grammar contained many ambiguities not present in a straight postgresql grammar.

0 replies

jimidle · 2025-05-08T17:50:37Z

jimidle
May 8, 2025

You can likely do this with a lexer with modes. Capture strings in normal mode and trigger a comment mode when your comment introducers are found. In comment mode, return different tokens than main mode until your comment closes (remember that comments can be nested so you may or may not need to recurse the lexer rule or maybe not. STRING: ‘“‘ … Capture STRINGS COMMENT_START: ‘/*’ - pushMode(commentMode) ; C: .; mode commentMode; COMMAND: ‘xyz’ ; OTHER: ‘other’ ; … COMMENT_END: ‘*/‘ -> popMode ; CHAR: . -> type(C) ; Then your parser is trivial. There is no need to write a grammar for the commented language.

…

On Thu, May 8, 2025 at 05:14 Ken Domino ***@***.***> wrote: The parser should split it in the following structure: file: content: "public class Example {\n public static void main(String[] args) {\n " comment: opener: "/*" body: "Nothing yet" closer: "*/" content: "\n }\n}" What you are describing is a parse tree with interleaved sub-language content (comments). The off-channel content (comment) is parsed in a separate language. You can do this with Antlr, but it will need to be done in several phases. 1. Create a grammar for "main" language. Parse input using "main" language (I guess csharp <https://github.com/antlr/grammars-v4/tree/master/csharp>). The result is a parse tree and token stream with (off-channel) comments. 2. Create a grammar for "sub" language. Parse each off-channel comment token and compute a tree an sub token stream. 3. Write a tree walker that constructs a new tree that stitches the two types of parse trees together. This problem is pretty common. An example is the postgresql grammar <https://github.com/antlr/grammars-v4/commits/master/sql/postgresql>, where we have postgresql and meta commands mixed in input. The meta commands are not part of postgresql. Originally, someone tried to combine the grammars of the two languages together but ended up with a huge mess. The two languages have different lexing requirements. And, the combined grammar contained many ambiguities not present in a straight postgresql grammar. — Reply to this email directly, view it on GitHub <#4831 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMDOK3VIOVDCDPFN6AT25M4C3AVCNFSM6AAAAAB4UXCCVGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGMBXGQYTENA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separating comment blocks from a source file #4831

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Separating comment blocks from a source file #4831

Uh oh!

kikugie May 7, 2025

Replies: 2 comments

Uh oh!

kaby76 May 8, 2025

Uh oh!

jimidle May 8, 2025

kikugie
May 7, 2025

kaby76
May 8, 2025

jimidle
May 8, 2025