-
Notifications
You must be signed in to change notification settings - Fork 186
Description
The text blacklists and watchlists are slower than they need to be, because under the current implementation, every entry is added to a huge regular expression. However, many of the entries are simple strings, or start with simple strings. This opens up the possibility of using map lookups (set
, dict
, etc.) on these strings to speed up matching. Map lookups are faster than regular expressions in Python, and have better Big-O scaling than regex |
alternation.
This is already used for the phone number lists. The easiest places to add this sort of optimization appear to be where we have bookending:
- the keyword watchlist, because it's full of single words, domains and email addresses: 77% of it appears to match
[a-z0-9]++(?:[\./@-][a-z0-9]++)*+
after stripping comments - the username blacklist, because many of its entries are manually anchored to the start and end of the string
For non-bookended matches like the website blacklist, I wonder if there would be any speedup to splitting up the regular expressions into groups, and "switching" on the first few initial characters. All things equal, it's of course fastest to stay in a single call to regex.search()
, but all things aren't equal—the methods scale differently.
To accomplish something like this, we would need a little refactoring of how blacklist rules are created and updated in the code. It wouldn't need to change anything user-facing.