Skip to content

Improve blacklist performance #14786

@talflon

Description

@talflon

The text blacklists and watchlists are slower than they need to be, because under the current implementation, every entry is added to a huge regular expression. However, many of the entries are simple strings, or start with simple strings. This opens up the possibility of using map lookups (set, dict, etc.) on these strings to speed up matching. Map lookups are faster than regular expressions in Python, and have better Big-O scaling than regex | alternation.

This is already used for the phone number lists. The easiest places to add this sort of optimization appear to be where we have bookending:

  • the keyword watchlist, because it's full of single words, domains and email addresses: 77% of it appears to match [a-z0-9]++(?:[\./@-][a-z0-9]++)*+ after stripping comments
  • the username blacklist, because many of its entries are manually anchored to the start and end of the string

For non-bookended matches like the website blacklist, I wonder if there would be any speedup to splitting up the regular expressions into groups, and "switching" on the first few initial characters. All things equal, it's of course fastest to stay in a single call to regex.search(), but all things aren't equal—the methods scale differently.

To accomplish something like this, we would need a little refactoring of how blacklist rules are created and updated in the code. It wouldn't need to change anything user-facing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions