Improve blacklist performance

The text blacklists and watchlists are slower than they need to be, because under the current implementation, every entry is added to a huge regular expression. However, many of the entries are simple strings, or start with simple strings. This opens up the possibility of using map lookups (`set`, `dict`, etc.) on these strings to speed up matching. Map lookups are faster than regular expressions in Python, and have better Big-O scaling than regex `|` alternation.

This is already used for the phone number lists. The easiest places to add this sort of optimization appear to be where we have bookending:

- the keyword watchlist, because it's full of single words, domains and email addresses: 77% of it appears to match `[a-z0-9]++(?:[\./@-][a-z0-9]++)*+` after stripping comments
- the username blacklist, because many of its entries are manually anchored to the start and end of the string

For non-bookended matches like the website blacklist, I wonder if there would be any speedup to splitting up the regular expressions into groups, and "switching" on the first few initial characters. All things equal, it's of course fastest to stay in a single call to `regex.search()`, but all things *aren't* equal—the methods scale differently.

To accomplish something like this, we would need a little refactoring of how blacklist rules are created and updated in the code. It wouldn't need to change anything user-facing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve blacklist performance #14786

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve blacklist performance #14786

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions