Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858

petee-d · 2018-05-27T10:21:10Z

Fix for #857, where parsing whitespace in templates usually has quadratic time complexity.

It addresses two pathological regular expressions used in Jinja lexer, "root" and TOKEN_RAW_BEGIN (not illustrated in examples but suffers from the same issue). In the extremely simplified version of the root lexer regex shown below, notice whitespace may be consumed either by (.*?) or by \s*, with the latter having priority due to being greedy, but only in case the block of whitespace is followed by variable (or block/comment) opening tag with the "-" modifier (that forces left-stripping whitespace). This is a pathological case for the built-in Python re module and will have quadratic complexity due to backtracking done after consuming every single whitespace character (don't quote me on that though).

r'(.*?)(?:(?P<variable_begin>\s*{{-|{{))'

Jinja lexer does that to implement the whitespace stripping feature on the left side of tags, but this approach is extremely inefficient with built in Python re module (note the regex library doesn't suffer from this, but it is slower in general).

My approach uses .rstrip or .rfind + another regular expression to remove the whitespace inside the tokenizer.

Potential improvements

I found the code to be quite hacky, especially the inspection of type of tokens in method tokeniter (shouldn't it be tokenizer btw?), when I needed to add a flag for a rule to attempt l-stripping, I saw no better way to continue in that pattern if I wanted to avoid larger refactoring. Any ideas?
Is this project still sticking to 80 character maximum per line, as lexer.py apparently does? It makes the code very ugly and I see some modules no longer adhere to that limit. Is 120 character line maximum acceptable?
It feels like the code somewhat avoids private helper methods. I especially feel like my new code block for l-stripping should be in such a separate method and I will be happy to move it to one unless you are actually trying to avoid the overhead of calling methods (which in my opinion should be negligible here).

Testing and performance measurement

I tested the fix on almost 100k of different Jinja2 templates, created by users with different technical skills and used for various purposes (from templates without any jinja tags to huge personalized email templates). Parsing all templates produced the exact same parsing tree both before and after my fix. The Jinja2 test suite of course passes as well.

The table below shows the parsing times of the templates I tested the fix with, each parsing attempt before or after was added to a bucket based on rounded log2 of the parsing time. You can see the 4 extremely slow cases being fixed and a general improvement of parsing time.

Seconds | Before | After
0.00006 |    873 |  1775
0.00012 |  19690 | 21348
0.00024 |   8805 |  9151
0.00049 |   5294 |  5331
0.00098 |   3603 |  2146
0.00195 |   2002 |  4824
0.00391 |   3109 |  8404
0.00781 |   7778 |  7759
0.01563 |   8455 |  3419
0.03125 |   4889 |  2234
0.06250 |   1711 |    63
0.12500 |    177 |     3
0.25000 |     67 |     0
0.50000 |      0 |     0
1.00000 |      0 |     0
2.00000 |      0 |     0
4.00000 |      2 |     0
8.00000 |      2 |     0

I didn't add any test case. I could add a test that parses something like (' ' * 40000) and asserts the time it took didn't cross a reasonably large boundary, but I generally dislike such tests as they may be fragile to CI environment issues. Let me know what you think.

petee-d · 2018-05-27T10:31:17Z

Note the build failed on something unrelated to my changes (Python 3.3.6 environment creation and it passes locally for me), could someone with repo access try to restart it or fix the underlying issue?

…allets#857

davidism · 2019-10-05T14:37:14Z

Sorry for letting this sit so long! I've marked this for 2.11. Unfortunately, there have been some other changes to the lexer and lstrip_blocks, and this doesn't merge cleanly. If you see this and can address this, let me know, as you understand the changes better than me.

davidism · 2019-10-05T19:21:47Z

I rebased, and it seems like your fix for this also addressed #748, or at least it passes the tests added by #1014. I'll check a little more.

petee-d · 2019-10-06T15:08:40Z

Oh, hey there. I'm afraid I only barely remember the changes I have done almost year and a half ago, but can help if needed. I understand you already rebased this (without pushing I guess), so I won't dive into it unless you say you'd like me to.

BTW, I would be quite happy if this got merged (or otherwise fixed), our project was using a fork with my fix ever since I made it and being able to update Jinja would be nice.

davidism · 2019-10-06T16:31:26Z

No problem, I think I have a handle on it, I'll try to push and merge later today.

davidism · 2019-10-07T06:38:40Z

For some reason I couldn't push my rebase to the source repo for this PR, but GitHub happily let me merge and commit to it using the online editor. 🤷‍♂️

I reorganized the code a bit and added a ton of comments in case I ever need to revisit this.

jinja2/lexer.py

Introduced in pallets#858. Tests will follow, also results of performance testing.

petee-d mentioned this pull request May 27, 2018

Template parsing (lexing) can have quadratic time complexity #857

Closed

Fix lexer being extremely slow on large amounts of whitespace. Fixes p…

7d00a40

…allets#857

kevin-brown force-pushed the 857-lexer-whitespace-performance branch from 50aab7a to 7d00a40 Compare May 6, 2019 15:38

davidism added this to the 2.11.0 milestone Oct 5, 2019

davidism added 2 commits October 6, 2019 23:28

Merge branch 'master' into 857-lexer-whitespace-performance

7cf0e6d

add comments about lstrip code

37249c0

davidism merged commit 562e178 into pallets:master Oct 7, 2019

davidism added a commit that referenced this pull request Oct 7, 2019

add changelog for #858

c8b1b02

davidism mentioned this pull request Oct 7, 2019

Add support for auto-indented blocks #919

Closed

jab reviewed Oct 7, 2019

View reviewed changes

jinja2/lexer.py Show resolved Hide resolved

davidism mentioned this pull request Feb 1, 2020

lstrip_blocks doesn't preserve space before end block on the same line #1138

Closed

petee-d pushed a commit to exponea/jinja that referenced this pull request Apr 2, 2020

Fix unintended lstrip_blocks behavior. Fixes pallets#1138

95dd93e

Introduced in pallets#858. Tests will follow, also results of performance testing.

petee-d mentioned this pull request Apr 2, 2020

Fix unintended lstrip_blocks behavior #1183

Merged

petee-d pushed a commit to exponea/jinja that referenced this pull request Apr 13, 2020

Fix unintended lstrip_blocks behavior. Fixes pallets#1138

dbbd082

Introduced in pallets#858. Tests will follow, also results of performance testing.

github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858

Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858

Uh oh!

petee-d commented May 27, 2018 •

edited

Loading

Uh oh!

petee-d commented May 27, 2018

Uh oh!

davidism commented Oct 5, 2019

Uh oh!

davidism commented Oct 5, 2019

Uh oh!

petee-d commented Oct 6, 2019

Uh oh!

davidism commented Oct 6, 2019

Uh oh!

davidism commented Oct 7, 2019 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858

Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858

Uh oh!

Conversation

petee-d commented May 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Potential improvements

Testing and performance measurement

Uh oh!

petee-d commented May 27, 2018

Uh oh!

davidism commented Oct 5, 2019

Uh oh!

davidism commented Oct 5, 2019

Uh oh!

petee-d commented Oct 6, 2019

Uh oh!

davidism commented Oct 6, 2019

Uh oh!

davidism commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

petee-d commented May 27, 2018 •

edited

Loading

davidism commented Oct 7, 2019 •

edited

Loading