Skip to content

Catch more posts whose text is mostly repeated #14788

@talflon

Description

@talflon

Is your feature request related to a problem? Please describe.

There are a lot of spammer or troll posts which consist mostly of a single text copied and pasted multiple times. Some of these are currently caught by some combination of "Repeating words", "Repeating characters", or "Few unique characters", but many aren't.

Here's a Metasmoke search showing some of those posts (requires logging into Metasmoke). Like posts that match the existing "Repeating" and "Few unique" rules I mentioned above, these posts are characterized by lazy attempts to fill the textbox instead of actually posting something of meaning, so the vast majority are TP, and the few FPs and NAAs are not very valuable. And it looks like on a daily basis a spammer posts their spam payload as repeated text in the post body.

Right now we catch a few of these by watchlist or blacklist entries, or by humans seeing them by eye. But as long as we have a computer working for us, the computer ought to be able to check for text repetition better than we can. And that might surface some new spam or troll posts more quickly and automatically, before we add blacklist entries specific to them.

Describe the solution you'd like

I'd like SmokeDetector to check directly for this sort of post, and include enough information in its "why" text so us humans can quickly see what it found. We could add a check for this into the "Repeating words" rule, so that findspam.has_repeating_words() would become:

  1. Run the existing has_repeated_words() logic, and if it finds something just return that
  2. If that didn't find anything, check if >= X % of the text consists of the same text repeated, which is not a single repeated character
  3. If so, return something like

    Repeated words in answer: xxx characters repeated yyy times ('start of text' ... 'end of text')

Describe alternatives you've considered

This could also be done as a new rule, such as "Repeating text". I think either way we should attempt to avoid duplicate reports for posts already caught by similar rules.

It might not be worth checking titles or usernames for this.

Checking that the entire post is repeated, with no additional text, seems too restrictive, even if it might be faster to check. Oftentimes spammers post a small amount of text before or after the repeated text.

We could also look for repeated text with a little non-repeating text in between repetitions, like

one spam spamm spammm two spam spamm spammm three spam spam spammm four

I'm afraid of doing that for the following reasons:

  1. I imagine it could cause FP where the author is purposefully repeating themselves, or discussing something repetitive.
  2. I imagine it might be slower to search for.
  3. I imagine it won't find much more TP, because most people posting this sort of thing will simply CtrlC, CtrlV... and that would create the version without non-repeating text in between.

But I haven't investigated enough to be sure if my fears are correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: spamchecksDetections or the process of testing posts. (No space in the label, is because of Hacktoberfest)type: feature requestShinies.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions