Skip to content

Conversation

@valentijnscholten
Copy link
Member

@valentijnscholten valentijnscholten commented Oct 21, 2025

Traditionally Defect Dojo has been deduplicating (new) findings one-by-one. This works well for small imports and has the benefit of an easy to understand codebase and test suite.

For larger imports however the performance is bad and resource usage is (very) high. A 1000+ finding import can cause a celery worker to spend minutes on deduplication.

This PR changes the deduplication process for import and reimport to be done in batches. This biggest benefit is that there now will be 1 database query per batch (1000 findings), instead of 1 query per finding (1000 queries).

During the development of the PR I realized:

Although batching dedupe sounds like a simple PR, the caveat is that with the one-by-one deduplication the result of the deduplication of the first finding in the report can have an affect on the deduplication result of the next findings (if there are duplicates inside the same report). This should be a corner case and usually means the deduplication configuration need some fine tuning. Nevertheless we wanted to make not to cause unexpected/different behavior here. The new tests should cover this.

The PR splits the deduplication process in three parts:

  1. Finding possible candidates
  2. Match the (new) finding against the candidates
  3. Act upon it if a match is found

One of the reasons for doing this is that we want to use the exact same matching logic for the reimport process. Currently that has an almost identical matching algorithm, but with minor unintentional differences. Once this PR here has proven itself, we will adjust the reimport process. Next to the "reimport matching" the reimport process also deduplicates new findings. This part is already using the batchwise deduplication in this PR.

A quick test with the jfrog_xray_unified/very_many_vulns.json samples scan (10k findings) shwo the obvious huge improvement in deduplication time. Please note that we're not only doing this for performance, but also to reduce the resources (cloud cost) needed to run Defect Dojo.

initial import (no duplicates):

branch import time dedupe time total time
dev ~200s ~400s ~600s
dedupe-batching ~190s ~12s ~200s

second import into the same product (all duplicates):
initial import (no duplicates):

branch import time dedupe time total time
dev ~200s ~400s ~600s
dedupe-batching ~190s ~180s ~370s

Imagine what this can do for reimport performance if we switch that to batch mode.

@github-actions github-actions bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Nov 8, 2025
@valentijnscholten valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Nov 12, 2025
Copy link
Contributor

@mtesauro mtesauro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

Copy link
Contributor

@dogboat dogboat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, just had one question.

@valentijnscholten valentijnscholten marked this pull request as draft November 13, 2025 21:57
@valentijnscholten valentijnscholten marked this pull request as ready for review November 14, 2025 17:00
@dryrunsecurity
Copy link

DryRun Security

This pull request introduces three security concerns: a potential arbitrary code execution path by dynamically loading a hash-compute method from an environment-controllable setting, a denial-of-service risk from an unvalidated small batch-size setting that can cause many Celery tasks to be spawned, and an unauthorized queue-purge risk where the management command accepts arbitrary queue names (and can be forced) allowing destructive purges of critical Celery queues.

Arbitrary Code Execution via Custom Method Loading in dojo/models.py
Vulnerability Arbitrary Code Execution via Custom Method Loading
Description The application dynamically loads a method for computing a hash code using get_custom_method("FINDING_COMPUTE_HASH_METHOD"). Based on the observed pattern in dojo/settings/settings.dist.py where settings are loaded from environment variables using env(), it is highly probable that the FINDING_COMPUTE_HASH_METHOD setting can be controlled via an environment variable (e.g., DD_FINDING_COMPUTE_HASH_METHOD). If an attacker can control this environment variable, they can inject an arbitrary module path and function name (e.g., 'os.system'). When get_custom_method resolves this string into a callable and compute_hash_code_method(self) is invoked, it leads to arbitrary code execution.

if compute_hash_code_method := get_custom_method("FINDING_COMPUTE_HASH_METHOD"):
deduplicationLogger.debug("using custom FINDING_COMPUTE_HASH_METHOD method")
return compute_hash_code_method(self)

Denial of Service via Misconfiguration in dojo/importers/default_importer.py
Vulnerability Denial of Service via Misconfiguration
Description The DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE setting, which controls the batch size for asynchronous deduplication tasks, lacks validation for a minimum value. An administrator can configure this setting to a very low number (e.g., 1), causing the system to dispatch an excessive number of small Celery tasks during large imports or reimports. This can overwhelm the message broker and workers, leading to resource exhaustion and a denial of service for background processing. Each finding would result in a separate Celery task being dispatched, incurring significant overhead.

batch_max_size = getattr(settings, "IMPORT_REIMPORT_DEDUPE_BATCH_SIZE", 1000)
"""
Saves findings in memory that were parsed from the scan report into the database.

Unauthorized Celery Queue Purge (Denial of Service) in dojo/management/commands/clear_celery_queue.py
Vulnerability Unauthorized Celery Queue Purge (Denial of Service)
Description The clear_celery_queue management command allows an attacker who can execute Django management commands to specify an arbitrary Celery queue name via the --queue argument. This input is not validated or sanitized against an allowlist of safe-to-purge queues. Consequently, an attacker could purge critical application queues (e.g., 'celery', or queues handling deduplication/post-processing tasks), leading to a denial of service by halting essential background processes, causing data inconsistencies, and disrupting application functionality. While the command includes a confirmation prompt, it can be bypassed with the --force flag, making the operation destructive without user interaction.

purged_count = channel.queue_purge(queue=queue)
total_purged += purged_count
self.stdout.write(
self.style.SUCCESS(f" ✓ Purged {purged_count} messages from queue: {queue}"),


All finding details can be found in the DryRun Security Dashboard.

@valentijnscholten valentijnscholten merged commit 68f6639 into DefectDojo:dev Nov 14, 2025
150 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects_pro PRs that affect Pro and need a coordinated release/merge moment. settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR unittests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants