-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858
Conversation
Note the build failed on something unrelated to my changes (Python 3.3.6 environment creation and it passes locally for me), could someone with repo access try to restart it or fix the underlying issue? |
50aab7a
to
7d00a40
Compare
Sorry for letting this sit so long! I've marked this for 2.11. Unfortunately, there have been some other changes to the lexer and |
Oh, hey there. I'm afraid I only barely remember the changes I have done almost year and a half ago, but can help if needed. I understand you already rebased this (without pushing I guess), so I won't dive into it unless you say you'd like me to. BTW, I would be quite happy if this got merged (or otherwise fixed), our project was using a fork with my fix ever since I made it and being able to update Jinja would be nice. |
No problem, I think I have a handle on it, I'll try to push and merge later today. |
For some reason I couldn't push my rebase to the source repo for this PR, but GitHub happily let me merge and commit to it using the online editor. 🤷♂️ I reorganized the code a bit and added a ton of comments in case I ever need to revisit this. |
Introduced in pallets#858. Tests will follow, also results of performance testing.
Introduced in pallets#858. Tests will follow, also results of performance testing.
Fix for #857, where parsing whitespace in templates usually has quadratic time complexity.
It addresses two pathological regular expressions used in Jinja lexer, "root" and TOKEN_RAW_BEGIN (not illustrated in examples but suffers from the same issue). In the extremely simplified version of the root lexer regex shown below, notice whitespace may be consumed either by
(.*?)
or by\s*
, with the latter having priority due to being greedy, but only in case the block of whitespace is followed by variable (or block/comment) opening tag with the "-" modifier (that forces left-stripping whitespace). This is a pathological case for the built-in Pythonre
module and will have quadratic complexity due to backtracking done after consuming every single whitespace character (don't quote me on that though).Jinja lexer does that to implement the whitespace stripping feature on the left side of tags, but this approach is extremely inefficient with built in Python
re
module (note theregex
library doesn't suffer from this, but it is slower in general).My approach uses
.rstrip
or.rfind
+ another regular expression to remove the whitespace inside the tokenizer.Potential improvements
tokens
in methodtokeniter
(shouldn't it betokenizer
btw?), when I needed to add a flag for a rule to attempt l-stripping, I saw no better way to continue in that pattern if I wanted to avoid larger refactoring. Any ideas?Testing and performance measurement
I tested the fix on almost 100k of different Jinja2 templates, created by users with different technical skills and used for various purposes (from templates without any jinja tags to huge personalized email templates). Parsing all templates produced the exact same parsing tree both before and after my fix. The Jinja2 test suite of course passes as well.
The table below shows the parsing times of the templates I tested the fix with, each parsing attempt before or after was added to a bucket based on rounded log2 of the parsing time. You can see the 4 extremely slow cases being fixed and a general improvement of parsing time.
I didn't add any test case. I could add a test that parses something like (
' ' * 40000
) and asserts the time it took didn't cross a reasonably large boundary, but I generally dislike such tests as they may be fragile to CI environment issues. Let me know what you think.