Catch more posts whose text is mostly repeated

### Is your feature request related to a problem? Please describe.

There are a lot of spammer or troll posts which consist mostly of a single text copied and pasted multiple times. Some of these are currently caught by some combination of "Repeating words", "Repeating characters", or "Few unique characters", but many aren't.

Here's [a Metasmoke search showing some of those posts](https://m.erwaysoftware.com/search?body_is_regex=1&body=%5E.%7B0%2C10%7D%28%28%5Cw%29%28%3F%21%5C2%29.%7B15%2C300%7D%5Cw%29%28%3F%3D%5C1%2B%2B.%7B0%2C10%7D%24%29&why_is_regex=1&why_is_inverse_regex=1&why=%28%3F-i%3A%28%3F%3ARepeating%7CFew+unique%29+%28%3F%3Acharacters%7Cwords%29%29) (requires logging into Metasmoke). Like posts that match the existing "Repeating" and "Few unique" rules I mentioned above, these posts are characterized by lazy attempts to fill the textbox instead of actually posting something of meaning, so the vast majority are TP, and the few FPs and NAAs are not very valuable. And it looks like on a daily basis a spammer posts their spam payload as repeated text in the post body.

Right now we catch a few of these by watchlist or blacklist entries, or by humans seeing them by eye. But as long as we have a computer working for us, the computer ought to be able to check for text repetition better than we can. And that might surface some new spam or troll posts more quickly and automatically, before we add blacklist entries specific to them.

### Describe the solution you'd like

I'd like SmokeDetector to check directly for this sort of post, and include enough information in its "why" text so us humans can quickly see what it found. We could add a check for this into the "Repeating words" rule, so that `findspam.has_repeating_words()` would become:

1. Run the existing `has_repeated_words()` logic, and if it finds something just return that
2. If that didn't find anything, check if >= X % of the text consists of the same text repeated, which is not a single repeated character
3. If so, return something like
   > Repeated words in answer: xxx characters repeated yyy times ('start of text' ... 'end of text')

### Describe alternatives you've considered

This could also be done as a new rule, such as "Repeating text". I think either way we should attempt to avoid duplicate reports for posts already caught by similar rules.

It might not be worth checking titles or usernames for this.

Checking that the entire post is repeated, with no additional text, seems too restrictive, even if it might be faster to check. Oftentimes spammers post a small amount of text before or after the repeated text.

We could also look for repeated text with a little non-repeating text in between repetitions, like

> **one** spam spamm spammm **two** spam spamm spammm **three** spam spam spammm **four**

I'm afraid of doing that for the following reasons:

1. I imagine it could cause FP where the author is purposefully repeating themselves, or discussing something repetitive.
2. I imagine it might be slower to search for.
3. I imagine it won't find much more TP, because most people posting this sort of thing will simply <kbd>Ctrl</kbd><kbd>C</kbd>, <kbd>Ctrl</kbd><kbd>V</kbd>... and that would create the version without non-repeating text in between.

But I haven't investigated enough to be sure if my fears are correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Catch more posts whose text is mostly repeated #14788

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Catch more posts whose text is mostly repeated #14788

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions