-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Auto-atomic for more loops followed by boundaries #117892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Today we will make a loop like `\w+\b` or `\d+\b` atomic, because the only thing the `\b` can match after that point is a non-word character, and that means it can't give back a word character or digit to satisfy the loop. But we can extend that further, since we can use the same logic to make such a loop atomic as long as the only things it can match are any subset of word characters. So, for example `[a-f0-9]+\b`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the automatic atomic grouping optimization for regular expressions by allowing loops followed by word boundaries to be made atomic when the loop matches only word character subsets. Previously, only specific character classes like \w+\b and \d+\b could be optimized, but now character classes like [a-f0-9]+\b and [\p{L}\d]+\b can also be made atomic.
Key changes:
- Replaces hardcoded character class checks with a more general method to detect word character subsets
- Adds comprehensive logic to determine if a character class is a subset of word characters
- Expands test coverage for the new optimization cases
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| RegexCharClass.cs | Adds IsKnownWordClassSubset method to determine if character classes are subsets of word characters |
| RegexNode.cs | Replaces hardcoded checks with the new IsKnownWordClassSubset method |
| RegexReductionTests.cs | Adds comprehensive test cases for the expanded atomic optimization |
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Show resolved
Hide resolved
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Show resolved
Hide resolved
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
|
@MihuBot regexdiff |
|
16 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9] ..." (293 uses)[GeneratedRegex("\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b", RegexOptions.CultureInvariant)] /// ○ Match a character in the set [0-9].<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match a character in the set [01] greedily, optionally.<br/>
- /// ○ Match a character in the set [0-9] greedily at least 1 and at most 2 times.<br/>
+ /// ○ Match a character in the set [0-9] atomically at least 1 and at most 2 times.<br/>
/// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
int alternation_starting_pos1 = 0;
int charloop_starting_pos = 0, charloop_ending_pos = 0;
int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
- int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
int loop_iteration = 0;
int stackpos = 0;
int startingStackpos = 0;
CharLoopEnd1:
//}
- // Match a character in the set [0-9] greedily at least 1 and at most 2 times.
- //{
- charloop_starting_pos2 = pos;
-
+ // Match a character in the set [0-9] atomically at least 1 and at most 2 times.
+ {
int iteration1 = 0;
while (iteration1 < 2 && (uint)iteration1 < (uint)slice.Length && char.IsAsciiDigit(slice[iteration1]))
{
slice = slice.Slice(iteration1);
pos += iteration1;
-
- charloop_ending_pos2 = pos;
- charloop_starting_pos2++;
- goto CharLoopEnd2;
-
- CharLoopBacktrack2:
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos2 >= charloop_ending_pos2)
- {
- goto CharLoopBacktrack1;
- }
- pos = --charloop_ending_pos2;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd2:
- //}
+ }
alternation_branch = 1;
goto AlternationMatch1;
case 0:
goto AlternationBranch1;
case 1:
- goto CharLoopBacktrack2;
+ goto CharLoopBacktrack1;
}
AlternationMatch1:;"\\b[\\w\\d\\.\\-]+\\@[\\w\\d\\.\\-]+\\.[a-z] ..." (267 uses)[GeneratedRegex("\\b[\\w\\d\\.\\-]+\\@[\\w\\d\\.\\-]+\\.[a-z]{2,6}\\b")] /// ○ Match '@'.<br/>
/// ○ Match a character in the set [-.\w\d] greedily at least once.<br/>
/// ○ Match '.'.<br/>
- /// ○ Match a character in the set [a-z] greedily at least 2 and at most 6 times.<br/>
+ /// ○ Match a character in the set [a-z] atomically at least 2 and at most 6 times.<br/>
/// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
int matchStart = pos;
char ch;
int charloop_starting_pos = 0, charloop_ending_pos = 0;
- int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match if at a word boundary.
goto CharLoopBacktrack;
}
- // Match a character in the set [a-z] greedily at least 2 and at most 6 times.
- //{
+ // Match a character in the set [a-z] atomically at least 2 and at most 6 times.
+ {
pos++;
slice = inputSpan.Slice(pos);
- charloop_starting_pos1 = pos;
-
int iteration2 = 0;
while (iteration2 < 6 && (uint)iteration2 < (uint)slice.Length && char.IsAsciiLetterLower(slice[iteration2]))
{
slice = slice.Slice(iteration2);
pos += iteration2;
-
- charloop_ending_pos1 = pos;
- charloop_starting_pos1 += 2;
- goto CharLoopEnd1;
-
- CharLoopBacktrack1:
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos1 >= charloop_ending_pos1)
- {
- goto CharLoopBacktrack;
- }
- pos = --charloop_ending_pos1;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd1:
- //}
+ }
// Match if at a word boundary.
if (!Utilities.IsBoundary(inputSpan, pos))
{
- goto CharLoopBacktrack1;
+ goto CharLoopBacktrack;
}
// The input matched."\\A\\b[0-9a-f]+\\b\\Z" (178 uses)[GeneratedRegex("\\A\\b[0-9a-f]+\\b\\Z")] /// <code>
/// ○ Match if at the beginning of the string.<br/>
/// ○ Match if at a word boundary.<br/>
- /// ○ Match a character in the set [0-9a-f] greedily at least once.<br/>
+ /// ○ Match a character in the set [0-9a-f] atomically at least once.<br/>
/// ○ Match if at a word boundary.<br/>
/// ○ Match if at the end of the string or if before an ending newline.<br/>
/// </code>
{
int pos = base.runtextpos;
int matchStart = pos;
- int charloop_starting_pos = 0, charloop_ending_pos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match if at the beginning of the string.
return false; // The input didn't match.
}
- // Match a character in the set [0-9a-f] greedily at least once.
- //{
- charloop_starting_pos = pos;
-
+ // Match a character in the set [0-9a-f] atomically at least once.
+ {
int iteration = slice.IndexOfAnyExcept(Utilities.s_asciiHexDigitsLower);
if (iteration < 0)
{
slice = slice.Slice(iteration);
pos += iteration;
-
- charloop_ending_pos = pos;
- charloop_starting_pos++;
- goto CharLoopEnd;
-
- CharLoopBacktrack:
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos >= charloop_ending_pos)
- {
- return false; // The input didn't match.
- }
- pos = --charloop_ending_pos;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd:
- //}
+ }
// Match if at a word boundary.
if (!Utilities.IsBoundary(inputSpan, pos))
{
- goto CharLoopBacktrack;
+ return false; // The input didn't match.
}
// Match if at the end of the string or if before an ending newline.
if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
{
- goto CharLoopBacktrack;
+ return false; // The input didn't match.
}
// The input matched."\\A\\p{Lu}{2,}\\b" (117 uses)[GeneratedRegex("\\A\\p{Lu}{2,}\\b")] /// Explanation:<br/>
/// <code>
/// ○ Match if at the beginning of the string.<br/>
- /// ○ Match a character in the set [\p{Lu}] greedily at least twice.<br/>
+ /// ○ Match a character in the set [\p{Lu}] atomically at least twice.<br/>
/// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
{
int pos = base.runtextpos;
int matchStart = pos;
- int charloop_starting_pos = 0, charloop_ending_pos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match if at the beginning of the string.
return false; // The input didn't match.
}
- // Match a character in the set [\p{Lu}] greedily at least twice.
- //{
- charloop_starting_pos = pos;
-
+ // Match a character in the set [\p{Lu}] atomically at least twice.
+ {
int iteration = 0;
while ((uint)iteration < (uint)slice.Length && char.IsUpper(slice[iteration]))
{
slice = slice.Slice(iteration);
pos += iteration;
-
- charloop_ending_pos = pos;
- charloop_starting_pos += 2;
- goto CharLoopEnd;
-
- CharLoopBacktrack:
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos >= charloop_ending_pos)
- {
- return false; // The input didn't match.
- }
- pos = --charloop_ending_pos;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd:
- //}
+ }
// Match if at a word boundary.
if (!Utilities.IsBoundary(inputSpan, pos))
{
- goto CharLoopBacktrack;
+ return false; // The input didn't match.
}
// The input matched."(\\d+)|(\\b([MDCLXVI]+)\\b)" (112 uses)[GeneratedRegex("(\\d+)|(\\b([MDCLXVI]+)\\b)", RegexOptions.IgnoreCase)] /// ○ 2nd capture group.<br/>
/// ○ Match if at a word boundary.<br/>
/// ○ 3rd capture group.<br/>
- /// ○ Match a character in the set [CDILMVXcdilmvx] greedily at least once.<br/>
+ /// ○ Match a character in the set [CDILMVXcdilmvx] atomically at least once.<br/>
/// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
int capture_starting_pos = 0;
int capture_starting_pos1 = 0;
int capture_starting_pos2 = 0;
- int charloop_capture_pos = 0;
- int charloop_starting_pos = 0, charloop_ending_pos = 0;
- int stackpos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
- // Atomic group.
+ // Match with 2 alternative expressions, atomically.
{
- int atomic_stackpos = stackpos;
+ int alternation_starting_pos = pos;
+ int alternation_starting_capturepos = base.Crawlpos();
- // Match with 2 alternative expressions, atomically.
- //{
- int alternation_starting_pos = pos;
- int alternation_starting_capturepos = base.Crawlpos();
-
- // Branch 0
+ // Branch 0
+ {
+ // 1st capture group.
{
- // 1st capture group.
+ capture_starting_pos = pos;
+
+ // Match a Unicode digit atomically at least once.
{
- capture_starting_pos = pos;
-
- // Match a Unicode digit atomically at least once.
+ int iteration = 0;
+ while ((uint)iteration < (uint)slice.Length && char.IsDigit(slice[iteration]))
{
- int iteration = 0;
- while ((uint)iteration < (uint)slice.Length && char.IsDigit(slice[iteration]))
- {
- iteration++;
- }
-
- if (iteration == 0)
- {
- goto AlternationBranch;
- }
-
- slice = slice.Slice(iteration);
- pos += iteration;
+ iteration++;
}
- base.Capture(1, capture_starting_pos, pos);
+ if (iteration == 0)
+ {
+ goto AlternationBranch;
+ }
+
+ slice = slice.Slice(iteration);
+ pos += iteration;
}
- goto AlternationMatch;
-
- AlternationBranch:
- pos = alternation_starting_pos;
- slice = inputSpan.Slice(pos);
- UncaptureUntil(alternation_starting_capturepos);
+ base.Capture(1, capture_starting_pos, pos);
}
- // Branch 1
- {
- // 2nd capture group.
- //{
- capture_starting_pos1 = pos;
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
- // 3rd capture group.
- //{
- capture_starting_pos2 = pos;
-
- // Match a character in the set [CDILMVXcdilmvx] greedily at least once.
- //{
- charloop_starting_pos = pos;
-
- int iteration1 = slice.IndexOfAnyExcept(Utilities.s_ascii_1832400118324001);
- if (iteration1 < 0)
- {
- iteration1 = slice.Length;
- }
-
- if (iteration1 == 0)
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
- slice = slice.Slice(iteration1);
- pos += iteration1;
-
- charloop_ending_pos = pos;
- charloop_starting_pos++;
- goto CharLoopEnd;
-
- CharLoopBacktrack:
- UncaptureUntil(charloop_capture_pos);
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos >= charloop_ending_pos)
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
- pos = --charloop_ending_pos;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd:
- charloop_capture_pos = base.Crawlpos();
- //}
-
- base.Capture(3, capture_starting_pos2, pos);
-
- goto CaptureSkipBacktrack;
-
- CaptureBacktrack:
- goto CharLoopBacktrack;
-
- CaptureSkipBacktrack:;
- //}
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack;
- }
-
- base.Capture(2, capture_starting_pos1, pos);
- //}
-
- }
+ goto AlternationMatch;
- AlternationMatch:;
- //}
+ AlternationBranch:
+ pos = alternation_starting_pos;
+ slice = inputSpan.Slice(pos);
+ UncaptureUntil(alternation_starting_capturepos);
+ }
- stackpos = atomic_stackpos;
+ // Branch 1
+ {
+ // 2nd capture group.
+ {
+ capture_starting_pos1 = pos;
+
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
+ {
+ UncaptureUntil(0);
+ return false; // The input didn't match.
+ }
+
+ // 3rd capture group.
+ {
+ capture_starting_pos2 = pos;
+
+ // Match a character in the set [CDILMVXcdilmvx] atomically at least once.
+ {
+ int iteration1 = slice.IndexOfAnyExcept(Utilities.s_ascii_1832400118324001);
+ if (iteration1 < 0)
+ {
+ iteration1 = slice.Length;
+ }
+
+ if (iteration1 == 0)
+ {
+ UncaptureUntil(0);
+ return false; // The input didn't match.
+ }
+
+ slice = slice.Slice(iteration1);
+ pos += iteration1;
+ }
+
+ base.Capture(3, capture_starting_pos2, pos);
+ }
+
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
+ {
+ UncaptureUntil(0);
+ return false; // The input didn't match.
+ }
+
+ base.Capture(2, capture_starting_pos1, pos);
+ }
+
+ }
+
+ AlternationMatch:;
}
// The input matched."(\\b(?:(?:2(?:[0-4][0-9]|5[0-5])|[0-1]?[0-9] ..." (85 uses)[GeneratedRegex("(\\b(?:(?:2(?:[0-4][0-9]|5[0-5])|[0-1]?[0-9]?[0-9])\\.){3}(?:(?:2([0-4][0-9]|5[0-5])|[0-1]?[0-9]?[0-9]))\\b)")] /// ○ Match a character in the set [0-5].<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match a character in the set [01] greedily, optionally.<br/>
- /// ○ Match a character in the set [0-9] greedily at least 1 and at most 2 times.<br/>
+ /// ○ Match a character in the set [0-9] atomically at least 1 and at most 2 times.<br/>
/// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
int capture_starting_pos = 0;
int capture_starting_pos1 = 0;
int charloop_capture_pos = 0;
- int charloop_capture_pos1 = 0;
int charloop_starting_pos = 0, charloop_ending_pos = 0;
int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
- int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
int loop_iteration = 0;
int stackpos = 0;
int startingStackpos = 0;
charloop_capture_pos = base.Crawlpos();
//}
- // Match a character in the set [0-9] greedily at least 1 and at most 2 times.
- //{
- charloop_starting_pos2 = pos;
-
+ // Match a character in the set [0-9] atomically at least 1 and at most 2 times.
+ {
int iteration1 = 0;
while (iteration1 < 2 && (uint)iteration1 < (uint)slice.Length && char.IsAsciiDigit(slice[iteration1]))
{
slice = slice.Slice(iteration1);
pos += iteration1;
-
- charloop_ending_pos2 = pos;
- charloop_starting_pos2++;
- goto CharLoopEnd2;
-
- CharLoopBacktrack2:
- UncaptureUntil(charloop_capture_pos1);
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos2 >= charloop_ending_pos2)
- {
- goto CharLoopBacktrack1;
- }
- pos = --charloop_ending_pos2;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd2:
- charloop_capture_pos1 = base.Crawlpos();
- //}
+ }
alternation_branch = 1;
goto AlternationMatch1;
case 0:
goto AlternationBranch1;
case 1:
- goto CharLoopBacktrack2;
+ goto CharLoopBacktrack1;
}
AlternationMatch1:;For more diff examples, see https://gist.github.com/MihuBot/f53211ef2653356a070d40dbd2e82145 For a list of JIT diff improvements, see Improvements.md Sample source code for further analysisconst string JsonPath = "RegexResults-1254.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2Lh8F_A");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
|
/ba-g android/wasm infra |
Today we will make a loop like
\w+\bor\d+\batomic, because the only thing the\bcan match after that point is a non-word character, and that means it can't give back a word character or digit to satisfy the loop. But we can extend that further, since we can use the same logic to make such a loop atomic as long as the only things it can match are any subset of word characters. So, for example[a-f0-9]+\b.