-
-
Notifications
You must be signed in to change notification settings - Fork 15
New duplicate algorithm to check for similar entries #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
PeterLombaers
merged 14 commits into
asreview:main
from
george-gca:new_duplicate_algorithm
Feb 6, 2025
Merged
Changes from 8 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
5ddaf4a
Added more dependencies to project
george-gca 95b6569
Added new duplicate finding algorithm
george-gca 8b921e7
Added more params to deduplicate similar function
george-gca b0a5259
Added missing params for deduplicate similar entries
george-gca 7a6d916
Updated README
george-gca 8995c45
Added similar dedup info to Tutorials
george-gca 1762e15
Added missing example image
george-gca 413f23d
Fixed ruff warnings
george-gca a27bad9
Renamed similarity as threshold
george-gca 871d159
Printing when not using similar
george-gca 4657d9b
Printing PID of duplicate entries
george-gca 2a833a6
Also supporting PID when searching for similar
george-gca 4db85ae
Changed dedup without --similar to use old solution
george-gca 61c3c40
Created tests for dedup
george-gca File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
import re | ||
from difflib import SequenceMatcher | ||
|
||
import ftfy | ||
import pandas as pd | ||
from asreview import ASReviewData | ||
from rich.console import Console | ||
from rich.text import Text | ||
from tqdm import tqdm | ||
|
||
|
||
def _print_similar_list(similar_list: list[tuple[int, int]], data: pd.Series): | ||
print_seq_matcher = SequenceMatcher() | ||
console = Console() | ||
print('Found similar titles at lines:') | ||
|
||
for i, j in similar_list: | ||
print_seq_matcher.set_seq1(data.iloc[i]) | ||
print_seq_matcher.set_seq2(data.iloc[j]) | ||
text = Text() | ||
text.append(f"\nLines {i+1} and {j+1}:\n", style='bold') | ||
|
||
for tag, i1, i2, j1, j2 in print_seq_matcher.get_opcodes(): | ||
if tag == 'replace': | ||
# add rich strikethrough | ||
text.append(f'{data.iloc[i][i1:i2]}', style='red strike') | ||
text.append(f'{data.iloc[j][j1:j2]}', style='green') | ||
if tag == 'delete': | ||
text.append(f'{data.iloc[i][i1:i2]}', style='red strike') | ||
if tag == 'insert': | ||
text.append(f'{data.iloc[j][j1:j2]}', style='green') | ||
if tag == 'equal': | ||
text.append(f'{data.iloc[i][i1:i2]}', style='dim') | ||
|
||
console.print(text) | ||
|
||
print('') | ||
|
||
|
||
def drop_duplicates_by_similarity( | ||
asdata: ASReviewData, | ||
similarity: float = 0.98, | ||
skip_abstract: bool = False, | ||
discard_stopwords: bool = False, | ||
stopwords_language: str = 'english', | ||
strict_similarity: bool = False, | ||
verbose: bool = False): | ||
|
||
if skip_abstract: | ||
data = asdata.df['title'] | ||
else: | ||
data = pd.Series(asdata.texts) | ||
|
||
symbols_regex = re.compile(r'[^ \w\d\-_]') | ||
spaces_regex = re.compile(r'\s+') | ||
|
||
s = ( | ||
data | ||
.apply(ftfy.fix_text) | ||
.str.replace(symbols_regex, '', regex=True) | ||
.str.replace(spaces_regex, ' ', regex=True) | ||
.str.lower() | ||
.str.strip() | ||
.replace("", None) | ||
) | ||
|
||
if discard_stopwords: | ||
try: | ||
from nltk.corpus import stopwords | ||
stopwords_set = set(stopwords.words(stopwords_language)) | ||
except LookupError: | ||
import nltk | ||
nltk.download('stopwords') | ||
stopwords_set = set(stopwords.words(stopwords_language)) | ||
|
||
stopwords_regex = re.compile(rf'\b{"\\b|\\b".join(stopwords_set)}\b') | ||
s = s.str.replace(stopwords_regex, '', regex=True) | ||
|
||
duplicated = (s.duplicated()) & (s.notnull()) | ||
seq_matcher = SequenceMatcher() | ||
|
||
if verbose: | ||
similar_list = [] | ||
else: | ||
similar_list = None | ||
|
||
for i, text in tqdm(s.items(), total=len(s), desc="Deduplicating"): | ||
seq_matcher.set_seq2(text) | ||
|
||
for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items(): | ||
seq_matcher.set_seq1(t) | ||
|
||
if seq_matcher.real_quick_ratio() > similarity and \ | ||
seq_matcher.quick_ratio() > similarity and \ | ||
(not strict_similarity or seq_matcher.ratio() > similarity): | ||
|
||
if verbose and not duplicated[j]: | ||
similar_list.append((i, j)) | ||
|
||
duplicated[j] = True | ||
|
||
if verbose: | ||
_print_similar_list(similar_list, data) | ||
|
||
asdata.df = asdata.df[~duplicated].reset_index(drop=True) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.