Skip to content

Package assembly is extremely slow for PythonInstalledWheelMetadataFile handler with many package resources #4167

@AyanSinhaMahapatra

Description

@AyanSinhaMahapatra

There are some performance issues when doing a package assembly from scancode.io, see relevant issue: aboutcode-org/scancode.io#1398 for more information on this.

For example, if we run inspect_package on the following archive:
fontawesome-in-libs-dir-perf-test.tar.xz.txt

It hangs at processing and assigning resources to the package pkg:pypi/[email protected] which is created from the datafile fontawesome-in-libs-dir-perf-test.tar.xz-extract/fontawesome-in-libs-dir-test/.venv/lib/python3.10/site-packages/fontawesomefree-6.6.0.dist-info/METADATA , as there is a large directory of svg/css files included (All files in one huge directory).
This takes about 3 hours to complete processing, with ~8k resources and 3 packages.

Another example is similarly if we scan a installed wheel metadata for licensedcode with the licensedcode data directories (has a directory with~40k rules which takes tens of hours to complete).

This is due to how the PythonInstalledWheelMetadataFile.assign_package_to_resources function is implemented, possibly due to this line at https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/pypi.py#L479 which creates a list with all the children resources in a directory, just to check if there are more than 1 resources in that directory and get one resource.
This operation is done multiple times for each resource in the directory, and each query is a huge query which causes the performance degradation.

Sometimes we even hit the timeout on a single query:

scancodeio-worker-1  | 2025-02-26T09:04:23.192495227Z INFO   Selected package handler: PythonInstalledWheelMetadataFile
scancodeio-worker-1  | 2025-02-26T09:04:23.260150176Z INFO     Processing item: Package(type='pypi', namespace=None, name='fontawesomefree', version='6.6.0', datasource_id='pypi_wheel_metadata')
scancodeio-worker-1  | 2025-02-27T00:08:13.644525931Z Job 7081fa32-8a5a-4221-b369-1cb22abad6d4: error while executing failure callback
scancodeio-worker-1  | 2025-02-27T00:08:13.644540243Z Traceback (most recent call last):
scancodeio-worker-1  | 2025-02-27T00:08:13.644542165Z   File "/opt/scancodeio/aboutcode/pipeline/__init__.py", line 199, in execute
scancodeio-worker-1  | 2025-02-27T00:08:13.644543863Z     step(self)
scancodeio-worker-1  | 2025-02-27T00:08:13.644545394Z   File "/opt/scancodeio/scanpipe/pipelines/root_filesystem.py", line 96, in scan_for_application_packages
scancodeio-worker-1  | 2025-02-27T00:08:13.644547199Z     scancode.scan_for_application_packages(self.project, progress_logger=self.log)
scancodeio-worker-1  | 2025-02-27T00:08:13.644548642Z   File "/opt/scancodeio/scanpipe/pipes/scancode.py", line 443, in scan_for_application_packages
scancodeio-worker-1  | 2025-02-27T00:08:13.644550060Z     assemble_packages(project=project, progress_logger=progress_logger)
scancodeio-worker-1  | 2025-02-27T00:08:13.644551479Z   File "/opt/scancodeio/scanpipe/pipes/scancode.py", line 490, in assemble_packages
scancodeio-worker-1  | 2025-02-27T00:08:13.644552875Z     assemble_package(resource, project, processed_paths)
scancodeio-worker-1  | 2025-02-27T00:08:13.644554259Z   File "/opt/scancodeio/scanpipe/pipes/scancode.py", line 514, in assemble_package
scancodeio-worker-1  | 2025-02-27T00:08:13.644556763Z     for item in extracted_items:
scancodeio-worker-1  | 2025-02-27T00:08:13.644558199Z                 ^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644559559Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/packagedcode/models.py", line 1185, in assemble
scancodeio-worker-1  | 2025-02-27T00:08:13.644561057Z     cls.assign_package_to_resources(
scancodeio-worker-1  | 2025-02-27T00:08:13.644562432Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/packagedcode/pypi.py", line 431, in assign_package_to_resources
scancodeio-worker-1  | 2025-02-27T00:08:13.644563893Z     ref_resource = get_resource_for_path(
scancodeio-worker-1  | 2025-02-27T00:08:13.644565220Z                    ^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644566588Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/packagedcode/pypi.py", line 479, in get_resource_for_path
scancodeio-worker-1  | 2025-02-27T00:08:13.644568045Z     children = [c for c in root.children(codebase) if c.name == seg]
scancodeio-worker-1  | 2025-02-27T00:08:13.644584150Z                            ^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644586732Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 400, in __iter__
scancodeio-worker-1  | 2025-02-27T00:08:13.644589265Z     self._fetch_all()
scancodeio-worker-1  | 2025-02-27T00:08:13.644591119Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1928, in _fetch_all
scancodeio-worker-1  | 2025-02-27T00:08:13.644599461Z     self._result_cache = list(self._iterable_class(self))
scancodeio-worker-1  | 2025-02-27T00:08:13.644601020Z                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644602359Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 91, in __iter__
scancodeio-worker-1  | 2025-02-27T00:08:13.644603821Z     results = compiler.execute_sql(
scancodeio-worker-1  | 2025-02-27T00:08:13.644605194Z               ^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644607170Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1574, in execute_sql
scancodeio-worker-1  | 2025-02-27T00:08:13.644608695Z     cursor.execute(sql, params)
scancodeio-worker-1  | 2025-02-27T00:08:13.644610030Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 79, in execute
scancodeio-worker-1  | 2025-02-27T00:08:13.644611445Z     return self._execute_with_wrappers(
scancodeio-worker-1  | 2025-02-27T00:08:13.644612741Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644614029Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 92, in _execute_with_wrappers
scancodeio-worker-1  | 2025-02-27T00:08:13.644615467Z     return executor(sql, params, many, context)
scancodeio-worker-1  | 2025-02-27T00:08:13.644616865Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644618291Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
scancodeio-worker-1  | 2025-02-27T00:08:13.644619755Z     return self.cursor.execute(sql, params)
scancodeio-worker-1  | 2025-02-27T00:08:13.644621095Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644622376Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/cursor.py", line 93, in execute
scancodeio-worker-1  | 2025-02-27T00:08:13.644623770Z     self._conn.wait(
scancodeio-worker-1  | 2025-02-27T00:08:13.644625104Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/connection.py", line 409, in wait
scancodeio-worker-1  | 2025-02-27T00:08:13.644626515Z     return waiting.wait(gen, self.pgconn.socket, interval=interval)
scancodeio-worker-1  | 2025-02-27T00:08:13.644627964Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1  | 2025-02-27T00:08:13.644629352Z   File "psycopg_binary/_psycopg/waiting.pyx", line 201, in psycopg_binary._psycopg.wait_c
scancodeio-worker-1  | 2025-02-27T00:08:13.644630831Z   File "/opt/scancodeio/.venv/lib/python3.12/site-packages/rq/timeouts.py", line 63, in handle_death_penalty
scancodeio-worker-1  | 2025-02-27T00:08:13.644632733Z     raise self._exception('Task exceeded maximum timeout value ({0} seconds)'.format(self._timeout))
scancodeio-worker-1  | 2025-02-27T00:08:13.644634107Z rq.timeouts.JobTimeoutException: Task exceeded maximum timeout value (86400 seconds)

This is also effecting docker image scans with large python installations (like pytorch based docker images).

Metadata

Metadata

Type

No type

Projects

Status

Validated

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions