Skip to content

Commit 8f1ae5c

Browse files
Update grammar format versioning rules et al. (#471)
* Refine grammar format versioning rules; a too new minor version now makes the grammar unparsable. * Clean-up grammar format spec. * Implement new versioning rules. * Add tests for `Grammar.IsUnparsable` and `HasUnknownData`. * Update release notes. * Do not emit group end symbols in the main DFA. This de facto makes the custom DFA group start states mandatory for parsers to understand. The format spec already requires it. Oriskany * Add specification for DFA state machines on bytes. We won't support them in the library right now, but if they are added in the future, they won't be affected by the `Critical` flag.
1 parent a8058cb commit 8f1ae5c

File tree

12 files changed

+157
-18
lines changed

12 files changed

+157
-18
lines changed

RELEASE_NOTES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ The following changes were made after 7.0.0-preview.1:
44
* __Breaking change:__ Removed the `IParserStateBox` interface and related APIs.
55
* Added `BuildSemanticProvider` extension method to `IGrammarBuilder<T>`.
66
* Added structures to the grammar file format that can make parsing groups two times faster or more. (https://github.com/teo-tsirpanis/Farkle/issues/153)
7+
* Updated the grammar file format specification to improve its version compatibility rules, and add specification for potential future features.
78
* Fixed referencing the Farkle package in F# Interactive.
89
* Fixed failures when rendering templates from very large grammars.
910
* Fixed a bug where the `IsSingleTokenizerInChain` extension method would return wrong results.

designs/7.0/grammar-file-format-spec.md

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,23 @@
22

33
This document describes the binary format of Farkle 7's grammars. It is heavily inspired by the Common Language Infrastructure metadata format described in [ECMA-335][ecma].
44

5+
The current version of the format is __7.0__.
6+
57
## Ground rules
68

79
* The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
10+
* Non-normative text is written in blockquotes.
811
* Numbers are stored in little-endian format.
912
* The total size of a grammar file MUST NOT exceed 2<sup>31</sup> - 1 bytes.
1013

14+
## Implementation requirements
15+
16+
Implementations of this specification are divided into three categories:
17+
18+
* _Writers_ create grammar files. They MUST implement the subset of the specification that is necessary to create valid grammar files according to the specification and their use case.
19+
* _Readers_ read grammar files for the purpose of displaying their content to users. They MAY ignore parts of the specification that dictate the behavior of parsers, or data structure specifications that they don't need to display.
20+
* _Parsers_ are a subset of readers, and read grammar files for the purpose of parsing input according to the grammar's specification, either directly, or by generating code that performs the parsing. They are REQUIRED to implement the entire specification, unless specified otherwise. They MAY forgo implementing parts of the specification that they don't need, as long as they report an error when encountering data that would require such parts to be implemented.
21+
1122
## Abstract file structure
1223

1324
From a top-level perspective, a grammar file is made of a _header_ and a sequence of _streams_.
@@ -25,10 +36,10 @@ A grammar file starts with the following data:
2536

2637
* If the first eight bits of the file are different than the __Magic__ field's expected value, readers MUST NOT read past them and MUST report an error.
2738
> Note that the magic code ends with two zeroes. It will prevent [GOLD Parser][gold] grammar file readers from continuing reading it.
28-
* If the value of the __MajorVersion__ field is larger than the expected one or smaller than `7`, readers MUST NOT read past the __MinorVersion__ field and MUST report an error.
29-
* When and if a file version after 7 gets specified, if the version of the file is smaller than the latest the reader supports, it MAY keep compatibility with the older format, or report an error.
39+
* If the value of the __MajorVersion__ field is outside the range of the major versions the reader supports, readers MUST NOT read past the __MinorVersion__ field and MUST report an error.
40+
* If the value of the __MajorVersion__ field is equal to the latest supported major version, and the value of the __MinorVersion__ field is larger than the latest supported major version, the grammar file MUST NOT be used for parsing.
3041

31-
A different than expected __MajorVersion__ field indicates that the file cannot be read at all. A different than expected __MinorVersion__ field indicates that the file can be read, but might be incorrectly interpreted.
42+
> A different than expected __MajorVersion__ field indicates that the file cannot be read at all. A different than expected __MinorVersion__ field indicates that the file can be read, but might be incorrectly interpreted.
3243
3344
### Stream Definition
3445

@@ -208,8 +219,8 @@ The following bit values are defined for the __Flags__ column:
208219
|Bit|Name|Description|
209220
|---|----|-----------|
210221
|0|`EndsOnEndOfInput`|The group can also end when the end of the input is reached, without encountering the token symbol specified in the __End__ column.|
211-
|1|`AdvanceByCharacter`|When inside this group, the parser should read the input without invoking the regular tokenizer.|
212-
|2|`KeepEndToken`|When the group ends, the parser should keep the token that ended the group in the input stream.|
222+
|1|`AdvanceByCharacter`|When inside this group, the parser must read the input without invoking the regular tokenizer.|
223+
|2|`KeepEndToken`|When the group ends, the parser must keep the token that ended the group in the input stream.|
213224

214225
The following rules apply to the _Group_ table:
215226

@@ -315,7 +326,7 @@ The following rules apply to the _Production_ table:
315326
* The value of the __FirstMember__ column of the first production MUST be equal to one.
316327
* If a production and all productions after it does not have any members, its __FirstMember__ column MUST be equal to the number of rows in the _ProductionMember_ table plus one.
317328

318-
> Before accessing the members of a production, readers MUST ensure that it actually has members.
329+
> Before accessing the members of a production, readers MUST ensure that the production actually has members.
319330
320331
### _ProductionMember_ table
321332

@@ -343,15 +354,20 @@ The following values are defined for the __Kind__ column:
343354
|3|LR(1) state machine.|
344355
|4|Generalized LR(1) (GLR(1)) state machine.|
345356
|5|Deterministic Finite Automaton (DFA) group start states on 16-bit character ranges.|
357+
|6|Deterministic Finite Automaton (DFA) on 8-bit character ranges.|
358+
|7|Deterministic Finite Automaton (DFA) on 8-bit character ranges with conflicts.|
359+
|8|Deterministic Finite Automaton (DFA) default transitions on 8-bit character ranges.|
360+
|9|Deterministic Finite Automaton (DFA) group start states on 8-bit character ranges.|
346361
|_anything else_|Reserved for future use by the Farkle project.|
347362

348363
> Instead of GLR(1) we could have called it "LR(1) state machine with conflicts" for symmetry, but this kind of state machine has an established name. Currently there are no plans to support GLR parsing in the Farkle project.
349364
350365
The following rules apply to the _StateMachine_ table:
351366

352367
* The __Kind__ column MUST NOT contain duplicate values.
353-
* If both state machines of __Kind__ 0 and 1, or 3 and 4 exist, they MUST describe the same state machine, with their only difference being in the preferred values in case of conflicts.
354-
* If a state machine of __Kind__ 2 exists, a state machine of __Kind__ 0 or 1 MUST also exist.
368+
* If both state machines of __Kind__ 0 and 1, 3 and 4, or 6 and 7 exist, they MUST describe the same state machine, with their only difference being in the preferred values in case of conflicts.
369+
* If state machines of __Kind__ 2 or 5 exist, a state machine of __Kind__ 0 or 1 MUST also exist.
370+
* If state machines of __Kind__ 8 or 9 exist, a state machine of __Kind__ 6 or 7 MUST also exist.
355371

356372
State machines with no states MUST be treated as if they do not exist.
357373

@@ -369,7 +385,7 @@ The following rules apply to the _SpecialName_ table:
369385
* The __Symbol__ column MUST NOT contain duplicate values.
370386
* The __Name__ column MUST NOT contain duplicate values.
371387

372-
> The use case for this table is to help custom code that integrates with parsers such as tokenizers. Since in Farkle symbols can be renamed and many can have the same name within a grammar, the special name provides a stable way to identify them (it sticks to the symbol's original name and duplicate names would cause build failures).
388+
> The use case for this table is to help custom code that integrates with parsers such as tokenizers. Since many symbols can have the same name within a grammar, the special name provides a guaranteed unique way to identify them.
373389
374390
## State machines
375391

@@ -512,6 +528,19 @@ Readers SHOULD expose an API that indicates whether a grammar file contains data
512528
* A state machine whose __Kind__ is not known to the reader.
513529
* A table whose kind is not specified in this specification.
514530

531+
### Summary of extensibility mechanisms
532+
533+
The following table summarizes the extensibility mechanisms provided by the format, and whether third parties can use them:
534+
535+
|I want to|How to achieve it|Available to third parties|
536+
|-|-|-|
537+
|Extend the format with data that does not affect parsing behavior and can be safely ignored|Add a custom stream|Yes|
538+
|Extend the format with data that might affect parsing behavior|Add a custom state machine|Yes|
539+
|Extend the format with data that might affect parsing behavior|Add a custom table|No|
540+
|Instruct parsers to refuse parsing if they do not understand your format extensions|Set the `Critical` flag in the _Grammar_ table|Yes|
541+
|Instruct parsers to refuse parsing if they do not understand your format extensions, while keeping the `Critical` flag available for third parties to use|Increase the __MinorVersion__ field in the file header|No|
542+
|Substantially change the format in a way incompatible for both readers and parsers|Increase the __MajorVersion__ field in the file header|No|
543+
515544
[ecma]: https://www.ecma-international.org/publications-and-standards/standards/ecma-335/
516545
[rfc2119]: https://www.rfc-editor.org/rfc/rfc2119
517546
[gold]: http://goldparser.org

src/Farkle/Builder/GrammarBuild.cs

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
// SPDX-License-Identifier: MIT
33

44
using System.Collections.Immutable;
5+
using System.Diagnostics;
56
using System.Numerics;
67
using Farkle.Builder.Dfa;
78
using Farkle.Builder.Lr;
@@ -50,6 +51,10 @@ private static string ExtractFirstPossibleCharacters(Regex regex)
5051
// Support only the subset of regexes the builder currently supports to start and end groups.
5152
// If we want to generalize groups in the future and have them be bounded by arbitrary regexes,
5253
// we might need to somehow run this inside the DFA builder.
54+
while (regex.IsAccept(out Regex? r, out _, out _))
55+
{
56+
regex = r;
57+
}
5358
if (regex == NewLineRegex)
5459
{
5560
return "\n\r";
@@ -67,9 +72,10 @@ private static string ExtractFirstPossibleCharacters(Regex regex)
6772
/// for when the tokenizer is inside a group. If a custom DFA cannot be used,
6873
/// this function will return <see langword="null"/>.
6974
/// </summary>
70-
private static Regex? GetGroupRegex(string start, Regex endRegex, TokenSymbolHandle endSymbol, bool isRecursive,
71-
GroupAttributes groupAttributes)
75+
private static Regex? GetGroupRegex(string start, Regex endRegexWithAccept, bool isRecursive,
76+
GroupAttributes groupAttributes, out bool addEndRegexToMainDfa)
7277
{
78+
addEndRegexToMainDfa = true;
7379
if ((groupAttributes & GroupAttributes.AdvanceByCharacter) == 0)
7480
{
7581
// Token groups cannot use a custom DFA starting state by definition.
@@ -93,15 +99,16 @@ private static string ExtractFirstPossibleCharacters(Regex regex)
9399
bool keepEndToken = (groupAttributes & GroupAttributes.KeepEndToken) != 0;
94100
if (keepEndToken)
95101
{
96-
prohibitedCharacters = prohibitedCharacters.AddRange(ExtractFirstPossibleCharacters(endRegex));
102+
prohibitedCharacters = prohibitedCharacters.AddRange(ExtractFirstPossibleCharacters(endRegexWithAccept));
97103
}
98104
// Set HighPriorityInverted because, if a recursive group starts and ends with the same character,
99105
// we must fail and leave it to the main DFA to determine which of the two (or none) happened.
100106
// Set BreakOnAccept, in order to stop reading random text when the group end gets matched.
101107
Regex result = Regex.Chars(prohibitedCharacters, Regex.CharsFlags.HighPriorityInverted | Regex.CharsFlags.BreakOnAccept).ZeroOrMore();
102108
if (!keepEndToken)
103109
{
104-
result += Regex.Accept(endRegex, endSymbol, lowestPriority: false);
110+
result += endRegexWithAccept;
111+
addEndRegexToMainDfa = false;
105112
}
106113
return result;
107114
}
@@ -352,15 +359,27 @@ void HandleGroup(string name, string start, string? endOrNewLine, GroupOptions o
352359
}
353360
else
354361
{
355-
endHandle = GetOrCreateGroupEndLiteral(endOrNewLine, out endRegex);
362+
endRegex = GetRegexForLiteral(endOrNewLine);
363+
endHandle = GetOrCreateGroupEndLiteral(endOrNewLine);
356364
}
357365
bool isRecursive = (options & GroupOptions.Recursive) != 0;
358366
GroupHandle groupHandle = writer.AddGroup(writer.GetOrAddString(name), container, flags, startHandle, endHandle, isRecursive ? 1 : 0);
359367
if (isRecursive)
360368
{
361369
writer.AddGroupNesting(groupHandle);
362370
}
363-
groupDfaRegexes?.Add(GetGroupRegex(start, endRegex, endHandle, isRecursive, flags));
371+
if (regexBuilder is not null)
372+
{
373+
Regex endRegexWithAccept = Regex.Accept(endRegex, endHandle, lowestPriority: false);
374+
bool addEndRegexToMainDfa = true;
375+
groupDfaRegexes?.Add(GetGroupRegex(start, endRegexWithAccept, isRecursive, flags, out addEndRegexToMainDfa));
376+
if (addEndRegexToMainDfa)
377+
{
378+
// The regex might be added multiple times, but it's OK since all accept
379+
// the same symbol, and won't cause any conflicts.
380+
regexBuilder.Add(endRegexWithAccept);
381+
}
382+
}
364383
}
365384

366385
// Gets the handle to a group end literal symbol, creating it if it does not exist.
@@ -373,16 +392,14 @@ void HandleGroup(string name, string start, string? endOrNewLine, GroupOptions o
373392
// bookkeeping ourselves. By storing the strings inside the general symbol map, we
374393
// also avoid conflicts between group end symbols and literals, which was not possible
375394
// before.
376-
TokenSymbolHandle GetOrCreateGroupEndLiteral(string content, out Regex regex)
395+
TokenSymbolHandle GetOrCreateGroupEndLiteral(string content)
377396
{
378-
regex = GetRegexForLiteral(content);
379397
if (symbolMap.TryGetValue(content, out EntityHandle existingHandle))
380398
{
381399
return (TokenSymbolHandle)existingHandle;
382400
}
383401
TokenSymbolHandle handle = writer.AddTokenSymbol(writer.GetOrAddString(content), TokenSymbolAttributes.None);
384402
dfaSymbols?.Add(handle, content, TokenSymbolKind.GroupEnd);
385-
regexBuilder?.Add(Regex.Accept(regex, handle, lowestPriority: false));
386403
symbolMap.Add(content, handle);
387404
return handle;
388405
}

src/Farkle/Grammars/Grammar.cs

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,16 @@ public abstract partial class Grammar : IGrammarProvider
2323
internal readonly BlobHeap BlobHeap;
2424
internal readonly GrammarTables GrammarTables;
2525

26+
/// <summary>
27+
/// The grammar file's format major version.
28+
/// </summary>
29+
public ushort FormatVersionMajor { get; }
30+
31+
/// <summary>
32+
/// The grammar file's format minor version.
33+
/// </summary>
34+
public ushort FormatVersionMinor { get; }
35+
2636
/// <summary>
2737
/// A read-only buffer to the <see cref="Grammar"/>'s binary data.
2838
/// </summary>
@@ -112,6 +122,8 @@ private protected Grammar(ReadOnlySpan<byte> grammarFile)
112122
{
113123
GrammarHeader header = GrammarHeader.Read(grammarFile);
114124
ValidateHeader(header);
125+
FormatVersionMajor = header.VersionMajor;
126+
FormatVersionMinor = header.VersionMinor;
115127

116128
GrammarStreams streams = new(grammarFile, header.StreamCount);
117129

@@ -433,6 +445,11 @@ internal bool IsUnparsable([NotNullWhen(true)] out string? errorResourceKey)
433445
errorResourceKey = nameof(Resources.Parser_UnparsableGrammar);
434446
return true;
435447
}
448+
if (FormatVersionMajor == GrammarConstants.VersionMajor && FormatVersionMinor > GrammarConstants.VersionMinor)
449+
{
450+
errorResourceKey = nameof(Resources.Parser_UnparsableGrammar_TooNewFormat);
451+
return true;
452+
}
436453
if (HasUnknownData && (flags & GrammarAttributes.Critical) != 0)
437454
{
438455
errorResourceKey = nameof(Resources.Parser_UnparsableGrammar_Critical);

src/Farkle/Grammars/GrammarConstants.cs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ internal static class GrammarConstants
2222
public const uint DfaOnCharDefaultTransitionsKind = 2;
2323
public const uint DfaOnCharGroupStartStatesKind = 5;
2424

25+
public const uint DfaOnByteKind = 6;
26+
public const uint DfaOnByteWithConflictsKind = 7;
27+
public const uint DfaOnByteDefaultTransitionsKind = 8;
28+
public const uint DfaOnByteGroupStartStatesKind = 9;
29+
2530
public const uint Lr1Kind = 3;
2631
public const uint Glr1Kind = 4;
2732

src/Farkle/Grammars/GrammarStateMachines.cs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ public GrammarStateMachines(ReadOnlySpan<byte> grammarFile, in BlobHeap blobHeap
3333
case GrammarConstants.DfaOnCharGroupStartStatesKind:
3434
AssignStateMachine(grammarFile, in blobHeap, kind, data, ref DfaOnChar.GroupStartStates, ref seenDfaOnCharGroupStartStates);
3535
break;
36+
case GrammarConstants.DfaOnByteKind:
37+
case GrammarConstants.DfaOnByteWithConflictsKind:
38+
case GrammarConstants.DfaOnByteDefaultTransitionsKind:
39+
case GrammarConstants.DfaOnByteGroupStartStatesKind:
40+
// We don't currently support DFAs on bytes; ignore them but don't set HasUnknownData.
41+
break;
3642
case GrammarConstants.Lr1Kind:
3743
AssignStateMachine(grammarFile, in blobHeap, kind, data, ref Lr1, ref seenLr1);
3844
break;

src/Farkle/Properties/Resources.cs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,8 @@ public static string GetEofString(IFormatProvider? formatProvider)
227227

228228
public static string Parser_UnparsableGrammar_Critical => GetResourceString(nameof(Parser_UnparsableGrammar_Critical));
229229

230+
public static string Parser_UnparsableGrammar_TooNewFormat => GetResourceString(nameof(Parser_UnparsableGrammar_TooNewFormat));
231+
230232
public static string Parser_GrammarLrMissing => GetResourceString(nameof(Parser_GrammarLrMissing));
231233

232234
public static string Parser_GrammarLrProblem => GetResourceString(nameof(Parser_GrammarLrProblem));

src/Farkle/Properties/Resources.el.resx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,9 @@
7272
<data name="Parser_UnparsableGrammar_Critical" xml:space="preserve">
7373
<value>Η γραμματική δεν μπορεί να χρησιμοποιηθεί για συντακτική ή λεκτική ανάλυση επειδή περιέχει δεδομένα που δεν αναγνωρίζονται από αυτήν την έκδοση του Farkle</value>
7474
</data>
75+
<data name="Parser_UnparsableGrammar_TooNewFormat" xml:space="preserve">
76+
<value>Η γραμματική δεν μπορεί να χρησιμοποιηθεί για συντακτική ή λεκτική ανάλυση επειδή η έκδοση μορφής της είναι πολύ νέα</value>
77+
</data>
7578
<data name="Parser_GrammarLrMissing" xml:space="preserve">
7679
<value>Η γραμματική δεν μπορεί να χρησιμοποιηθεί για συντακτική ανάλυση επειδή δεν έχει πίνακα καταστάσεων LR</value>
7780
</data>

src/Farkle/Properties/Resources.resx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,9 @@
7272
<data name="Parser_UnparsableGrammar_Critical" xml:space="preserve">
7373
<value>The grammar cannot be used for parsing or tokenizing because it contains data not recognized by this version of Farkle</value>
7474
</data>
75+
<data name="Parser_UnparsableGrammar_TooNewFormat" xml:space="preserve">
76+
<value>The grammar cannot be used for parsing or tokenizing because its format version is too new</value>
77+
</data>
7578
<data name="Parser_GrammarLrMissing" xml:space="preserve">
7679
<value>The grammar cannot be used for parsing because it does not contain an LR state table</value>
7780
</data>

0 commit comments

Comments
 (0)