Skip to content

How should we go about blacklisting Unicode-obfuscated keywords? #14355

@talflon

Description

@talflon

Current method

Currently we're blacklisting a number of Unicode-obfuscated keywords which have shown up multiple times, like

  • ℚ𝕒𝕥𝕒𝕣
  • 𝑬𝒙𝒑𝒆𝒅𝒊𝒂
  • ᏟႮՏͲϴᎷᎬᎡ
  • 𝔸𝕚rlines

I like this, because I think it's very unlikely someone would naturally use Unicode formatting on these words. But by only blacklisting the ones we notice, we're missing the opportunity to catch new variants, which we can already expect to see later: if we've seen "𝓕𝓪𝓬𝓮𝓫𝓸𝓸𝓴", it'll be no great surprise to later see "𝐅𝐚𝐜𝐞𝐛𝐨𝐨𝐤". And I've seen that some of the spammers, instead of simply formatting the whole word, will format individual letters. That's a large number of possible variants, so it'd be easier to write regexes or code to catch them automatically than to wait and see which letter they'll pick next.

Also, it looks to me like we might not be catching things l̤̈ï̤k̤̈ë̤ ẗ̤ḧ̤ï̤s̤̈.

I have a couple different ideas for how to tackle this and was hoping for feedback on which might be the best way forward. I'm happy to try my hand at any of these.

Idea A: expand the "Obfuscated word" reasons to include a lot more Unicode

This was originally my Plan B, but looking at the SmokeDetector code, it looks like it would be fairly simple to do, so I'm calling it my Plan A now. obfuscated_word() could be rewritten to catch a lot of these Unicode tricks, instead of just 1337 speak and p.u.n.c.t.u.a.t.i.o.n.

A number of the above-mentioned blacklisted keywords could be removed, if desired, once obfuscated_word() is patched to catch them.

A small downside is that obfuscation_keywords needs manual PRs to update, while the blacklisters are familiar and comfortable with updating the blacklist.

Idea A2: also check for blacklist and watchlist items

Many of the Unicode characters are unambiguous what letter they would stand for, even more so than in 1337 speak. It might be possible to do something like:

  1. "deobfuscate" titles and bodies based on unambiguous Unicode characters, and stripping diacritics
  2. run the blacklist and watchlist items over the deobfuscated versions
  3. anywhere there's a match, check that region of the original to see if it was changed by the deobfuscation
  4. if so, count it as an "obfuscated word"

That would effectively add the blacklist and watchlist to obfuscation_keywords, and ensure that long multi-word blacklist keywords can't be dodged by adding obfuscation to one of the words.

Idea B: long regexes

A regex something like

[WŴѠW🅆ẈẄẀẆẂW̊ꓪԜ𝕨𝑤𝖜𝐖𝖶𝑊𝒘𝖂𝙒𝙬𝐰𝘄𝑾𝓌𝗪𝗐𝚠𝔀𝔴𝚆𝘞𝓦𝔚𝘸𝕎𝒲ⓌƜᵂᴡ𑜎𑜏𑜊ԱᏔᎳᎳʷ]
\p{M}*+
[هھەہⵔంOₒŌŎŐꬽ𐊒𑢷𑢨𑢵סОዐO𐊫🄾ṒṐỠṌỌỐỚỢỒỘṎỔỖỎỜỞꓳଠ𐓂ℴ𐐄𝜎𝖮𝝾𝒪𝝈𝘰𝕺𝛐𝞂𝝄𝑂𝙾𝕆𝗢𝜪𝐨𝑜𝚘𝚶𝗼𝛔𝞞𝑶𝔬𝝤𝔒𝜊𝐎𝓞𝘖𝙤𝖔𝙊𝒐𝛰𝕠𝞼𝗈𝓸𝞸ΣΟºÔÕÒÓÖⓄⲞಂᲿﻫﻪﻩﻬᴼᴑᴏᵒ𞹤𞸤𞺄ՕංഠംǬȮȬȰȎǪȌȪƠǑ𐔖ﮨﮩﮧﮫﮬﮭﮦﮪဝ〇]
\p{M}*+
[RŘŖŔꭈꭇГR🅁ṘṚṜṞꓣ𐒴ℜℝℛᖇ𝔯𝖱𝗥𝙍𝑟𝕽𝗿𝘳𝑅𝒓𝗋𝚁𝓡𝑹𝘙𝐫𝕣𝖗𝐑𝓻𝓇𝙧𝚛ⓇⲄᴦᵣᴿᏒᎡᎱȐƦȒʳ𖼵]
\p{M}*+
[DĎⅮD🄳ḒḎḊḐḌꓓꓒԀⅆⅅᗞᗪᑯ𝗗𝔡𝖽𝗱𝓓𝖣𝘋𝒅𝒹𝔇𝓭𝔻𝘥𝐝𝕕𝙙𝐷𝖉𝚍𝕯𝘿𝑑𝐃𝑫𝙳𝒟ⒹᵈᴰᎠᏧ]
\p{M}*+
(?<!word)

(without the newlines) would match all sorts of obfuscations and Unicode formattings of "word". (\p{M}*+ is to catch the use of separate modifier characters, l̤̈ï̤k̤̈ë̤ ẗ̤ḧ̤ï̤s̤̈. And yes, there are characters that could be added or removed to the character sets above; I'm still working on lists.)

I imagine it would not be a costly regex to run, and it doesn't require any changes to be reviewed to the Python code.

But it would be a little annoying for humans to parse and copy/paste. It's possible that some of these regexes could be made a little shorter and/or easier to read by:

  • using the (?(DEFINE)) syntax that regex makes available to us
  • selectively choosing to match only some of the types of obfuscation on a per-blacklist-item basis

but they'd still be big.

Another downside to the regex format I'm considering above is that it's "1 strike and you're out". That is appropriate for some uses of Unicode (i.e. "ℜefund") but maybe not others which could be accidentally typed by an innocent person on a keyboard with dead key accents (i.e. "ćheap"). It's hard to embed logic other than "1 strike and you're out" into a regex; it would be easier to do in obfuscated_word().

Idea B1: web app & scripts, pasted directly to SmokeDetector

I was thinking of creating scripts and a single-page web app to help create these regexes. Blacklisters who are interested could use them to create these crazy regexes and use them in !!/blacklist-keyword and !!/watch.

Idea B2: offline scripts

Scripts could instead be used to, every once in a while, look for existing blacklists of things like "𝔸𝕚rlines", and manually suggest PRs replacing those with the longer regexes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions