Deduplicate findings in batches #13491

valentijnscholten · 2025-10-21T19:28:52Z

Traditionally Defect Dojo has been deduplicating (new) findings one-by-one. This works well for small imports and has the benefit of an easy to understand codebase and test suite.

For larger imports however the performance is bad and resource usage is (very) high. A 1000+ finding import can cause a celery worker to spend minutes on deduplication.

This PR changes the deduplication process for import and reimport to be done in batches. This biggest benefit is that there now will be 1 database query per batch (1000 findings), instead of 1 query per finding (1000 queries).

During the development of the PR I realized:

more test cases were needed: 13499, 13464, 13463, 13372.
some deduplication bugs needed fixing: 13513

Although batching dedupe sounds like a simple PR, the caveat is that with the one-by-one deduplication the result of the deduplication of the first finding in the report can have an affect on the deduplication result of the next findings (if there are duplicates inside the same report). This should be a corner case and usually means the deduplication configuration need some fine tuning. Nevertheless we wanted to make not to cause unexpected/different behavior here. The new tests should cover this.

The PR splits the deduplication process in three parts:

Finding possible candidates
Match the (new) finding against the candidates
Act upon it if a match is found

One of the reasons for doing this is that we want to use the exact same matching logic for the reimport process. Currently that has an almost identical matching algorithm, but with minor unintentional differences. Once this PR here has proven itself, we will adjust the reimport process. Next to the "reimport matching" the reimport process also deduplicates new findings. This part is already using the batchwise deduplication in this PR.

A quick test with the jfrog_xray_unified/very_many_vulns.json samples scan (10k findings) shwo the obvious huge improvement in deduplication time. Please note that we're not only doing this for performance, but also to reduce the resources (cloud cost) needed to run Defect Dojo.

initial import (no duplicates):

branch	import time	dedupe time	total time
dev	~200s	~400s	~600s
dedupe-batching	~190s	~12s	~200s

second import into the same product (all duplicates):
initial import (no duplicates):

branch	import time	dedupe time	total time
dev	~200s	~400s	~600s
dedupe-batching	~190s	~180s	~370s

Imagine what this can do for reimport performance if we switch that to batch mode.

mtesauro

Approved

dogboat

It looks good, just had one question.

dojo/finding/deduplication.py

dryrunsecurity · 2025-11-14T17:00:38Z

This pull request introduces three security concerns: a potential arbitrary code execution path by dynamically loading a hash-compute method from an environment-controllable setting, a denial-of-service risk from an unvalidated small batch-size setting that can cause many Celery tasks to be spawned, and an unauthorized queue-purge risk where the management command accepts arbitrary queue names (and can be forced) allowing destructive purges of critical Celery queues.

Arbitrary Code Execution via Custom Method Loading in dojo/models.py

Vulnerability	Arbitrary Code Execution via Custom Method Loading
Description	The application dynamically loads a method for computing a hash code using `get_custom_method("FINDING_COMPUTE_HASH_METHOD")`. Based on the observed pattern in `dojo/settings/settings.dist.py` where settings are loaded from environment variables using `env()`, it is highly probable that the `FINDING_COMPUTE_HASH_METHOD` setting can be controlled via an environment variable (e.g., `DD_FINDING_COMPUTE_HASH_METHOD`). If an attacker can control this environment variable, they can inject an arbitrary module path and function name (e.g., 'os.system'). When `get_custom_method` resolves this string into a callable and `compute_hash_code_method(self)` is invoked, it leads to arbitrary code execution.

django-DefectDojo/dojo/models.py

Lines 2919 to 2921 in 9011cee

    
           if compute_hash_code_method := get_custom_method("FINDING_COMPUTE_HASH_METHOD"): 
        
               deduplicationLogger.debug("using custom FINDING_COMPUTE_HASH_METHOD method") 
        
               return compute_hash_code_method(self)

Denial of Service via Misconfiguration in dojo/importers/default_importer.py

Vulnerability	Denial of Service via Misconfiguration
Description	The `DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE` setting, which controls the batch size for asynchronous deduplication tasks, lacks validation for a minimum value. An administrator can configure this setting to a very low number (e.g., 1), causing the system to dispatch an excessive number of small Celery tasks during large imports or reimports. This can overwhelm the message broker and workers, leading to resource exhaustion and a denial of service for background processing. Each finding would result in a separate Celery task being dispatched, incurring significant overhead.

django-DefectDojo/dojo/importers/default_importer.py

Lines 163 to 166 in 9011cee

    
                   batch_max_size = getattr(settings, "IMPORT_REIMPORT_DEDUPE_BATCH_SIZE", 1000) 
        
                   """ 
        
                   Saves findings in memory that were parsed from the scan report into the database.

Unauthorized Celery Queue Purge (Denial of Service) in dojo/management/commands/clear_celery_queue.py

Vulnerability	Unauthorized Celery Queue Purge (Denial of Service)
Description	The `clear_celery_queue` management command allows an attacker who can execute Django management commands to specify an arbitrary Celery queue name via the `--queue` argument. This input is not validated or sanitized against an allowlist of safe-to-purge queues. Consequently, an attacker could purge critical application queues (e.g., 'celery', or queues handling deduplication/post-processing tasks), leading to a denial of service by halting essential background processes, causing data inconsistencies, and disrupting application functionality. While the command includes a confirmation prompt, it can be bypassed with the `--force` flag, making the operation destructive without user interaction.

django-DefectDojo/dojo/management/commands/clear_celery_queue.py

Lines 99 to 102 in 9011cee

    
           purged_count = channel.queue_purge(queue=queue) 
        
           total_purged += purged_count 
        
           self.stdout.write( 
        
               self.style.SUCCESS(f"  ✓ Purged {purged_count} messages from queue: {queue}"),

All finding details can be found in the DryRun Security Dashboard.

valentijnscholten added 15 commits October 19, 2025 16:25

initial batching code

74c6563

fix dedupe_inside_engagement

01842eb

all tests working incl sarif with internal dupes

ccc5ad1

cleanup

01c4911

deduplication: add more importer unit tests

53b2258

deduplication: add more importer unit tests

4f6992d

deduplication: log hash_code_fields_always

15a06e6

view_finding: show unique_id_from_tool with hash_code

8bb5292

view_finding: show unique_id_from_tool with hash_code

b2ea7eb

uncomment tests

99bafd3

add more assessments

4d470f0

fix duplicate finding links

5d2768f

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

8b272d9

split per algo, move into new file

cdabfea

align logging

7f2f661

github-actions bot added the unittests label Oct 21, 2025

valentijnscholten added 14 commits October 21, 2025 21:29

better method name and param order

301c3c3

Merge remote-tracking branch 'upstream/dev' into dedupe-batching

18db8c9

ruff apps.py

e73ac73

update task/query counts

0945279

update comments, parameters names

d9dad18

finetune uidorhash logic

a1da692

fix tests to import from deduplication.py

2c6f941

ruff unit tests

0efac0c

simplify base queryset building

76b78d6

deduplication logic: add cross scanner unique_id tests

58d6934

hook old per finding dedupe to batch dedupe code

74a8b2d

fix and make uid_or_hash_code matching identical to old dedupe

95974ca

UNIQUE_ID_OR_HASH_CODE: dont stop after one candidate

9a876e3

UNIQUE_ID_OR_HASH_CODE: dont stop after one candidate in Batch mode

92a92ca

Valentijn Scholten added 2 commits November 8, 2025 14:46

update log line

42a5f48

make batch size a setting

1a721ea

github-actions bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Nov 8, 2025

add false positive history to new batch post process task

77e8ca1

valentijnscholten added the Breaking Changes label Nov 8, 2025

valentijnscholten and others added 8 commits November 9, 2025 09:26

commands: add command to clear celery queue

232fe7d

update dedupe command to use batch mode

b91330a

default to batch_mode for dedupe command

b336b75

do not deduplicate duplicates

93382b0

improve logging

edd8c04

prefetch better in dedupe command

ab18a94

dedupe command: max batch size 1000

239d7c7

remove leftover method

6954cba

Maffooch approved these changes Nov 12, 2025

View reviewed changes

Maffooch requested review from blakeaowens and dogboat November 12, 2025 11:50

valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Nov 12, 2025

mtesauro approved these changes Nov 13, 2025

View reviewed changes

valentijnscholten removed the Breaking Changes label Nov 13, 2025

dogboat approved these changes Nov 13, 2025

View reviewed changes

dojo/finding/deduplication.py Outdated Show resolved Hide resolved

blakeaowens approved these changes Nov 13, 2025

View reviewed changes

valentijnscholten marked this pull request as draft November 13, 2025 21:57

Valentijn Scholten added 3 commits November 14, 2025 12:04

reimport: support pro hash method

2230a03

finalize return statement

39338d1

ruff

9011cee

valentijnscholten requested a review from dogboat November 14, 2025 13:52

dogboat approved these changes Nov 14, 2025

View reviewed changes

valentijnscholten marked this pull request as ready for review November 14, 2025 17:00

valentijnscholten merged commit 68f6639 into DefectDojo:dev Nov 14, 2025
150 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate findings in batches #13491

Deduplicate findings in batches #13491

valentijnscholten commented Oct 21, 2025 •

edited

Loading

Uh oh!

mtesauro left a comment

Uh oh!

dogboat left a comment

Uh oh!

Uh oh!

dryrunsecurity bot commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Deduplicate findings in batches #13491

Deduplicate findings in batches #13491

Conversation

valentijnscholten commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtesauro left a comment

Choose a reason for hiding this comment

Uh oh!

dogboat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dryrunsecurity bot commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

valentijnscholten commented Oct 21, 2025 •

edited

Loading