-
-
Notifications
You must be signed in to change notification settings - Fork 604
Description
There are some performance issues when doing a package assembly from scancode.io, see relevant issue: aboutcode-org/scancode.io#1398 for more information on this.
For example, if we run inspect_package
on the following archive:
fontawesome-in-libs-dir-perf-test.tar.xz.txt
It hangs at processing and assigning resources to the package pkg:pypi/[email protected]
which is created from the datafile fontawesome-in-libs-dir-perf-test.tar.xz-extract/fontawesome-in-libs-dir-test/.venv/lib/python3.10/site-packages/fontawesomefree-6.6.0.dist-info/METADATA , as there is a large directory of svg/css files included (All files in one huge directory).
This takes about 3 hours to complete processing, with ~8k resources and 3 packages.
Another example is similarly if we scan a installed wheel metadata for licensedcode with the licensedcode data directories (has a directory with~40k rules which takes tens of hours to complete).
This is due to how the PythonInstalledWheelMetadataFile.assign_package_to_resources
function is implemented, possibly due to this line at https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/pypi.py#L479 which creates a list with all the children resources in a directory, just to check if there are more than 1 resources in that directory and get one resource.
This operation is done multiple times for each resource in the directory, and each query is a huge query which causes the performance degradation.
Sometimes we even hit the timeout on a single query:
scancodeio-worker-1 | 2025-02-26T09:04:23.192495227Z INFO Selected package handler: PythonInstalledWheelMetadataFile
scancodeio-worker-1 | 2025-02-26T09:04:23.260150176Z INFO Processing item: Package(type='pypi', namespace=None, name='fontawesomefree', version='6.6.0', datasource_id='pypi_wheel_metadata')
scancodeio-worker-1 | 2025-02-27T00:08:13.644525931Z Job 7081fa32-8a5a-4221-b369-1cb22abad6d4: error while executing failure callback
scancodeio-worker-1 | 2025-02-27T00:08:13.644540243Z Traceback (most recent call last):
scancodeio-worker-1 | 2025-02-27T00:08:13.644542165Z File "/opt/scancodeio/aboutcode/pipeline/__init__.py", line 199, in execute
scancodeio-worker-1 | 2025-02-27T00:08:13.644543863Z step(self)
scancodeio-worker-1 | 2025-02-27T00:08:13.644545394Z File "/opt/scancodeio/scanpipe/pipelines/root_filesystem.py", line 96, in scan_for_application_packages
scancodeio-worker-1 | 2025-02-27T00:08:13.644547199Z scancode.scan_for_application_packages(self.project, progress_logger=self.log)
scancodeio-worker-1 | 2025-02-27T00:08:13.644548642Z File "/opt/scancodeio/scanpipe/pipes/scancode.py", line 443, in scan_for_application_packages
scancodeio-worker-1 | 2025-02-27T00:08:13.644550060Z assemble_packages(project=project, progress_logger=progress_logger)
scancodeio-worker-1 | 2025-02-27T00:08:13.644551479Z File "/opt/scancodeio/scanpipe/pipes/scancode.py", line 490, in assemble_packages
scancodeio-worker-1 | 2025-02-27T00:08:13.644552875Z assemble_package(resource, project, processed_paths)
scancodeio-worker-1 | 2025-02-27T00:08:13.644554259Z File "/opt/scancodeio/scanpipe/pipes/scancode.py", line 514, in assemble_package
scancodeio-worker-1 | 2025-02-27T00:08:13.644556763Z for item in extracted_items:
scancodeio-worker-1 | 2025-02-27T00:08:13.644558199Z ^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644559559Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/packagedcode/models.py", line 1185, in assemble
scancodeio-worker-1 | 2025-02-27T00:08:13.644561057Z cls.assign_package_to_resources(
scancodeio-worker-1 | 2025-02-27T00:08:13.644562432Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/packagedcode/pypi.py", line 431, in assign_package_to_resources
scancodeio-worker-1 | 2025-02-27T00:08:13.644563893Z ref_resource = get_resource_for_path(
scancodeio-worker-1 | 2025-02-27T00:08:13.644565220Z ^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644566588Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/packagedcode/pypi.py", line 479, in get_resource_for_path
scancodeio-worker-1 | 2025-02-27T00:08:13.644568045Z children = [c for c in root.children(codebase) if c.name == seg]
scancodeio-worker-1 | 2025-02-27T00:08:13.644584150Z ^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644586732Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 400, in __iter__
scancodeio-worker-1 | 2025-02-27T00:08:13.644589265Z self._fetch_all()
scancodeio-worker-1 | 2025-02-27T00:08:13.644591119Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1928, in _fetch_all
scancodeio-worker-1 | 2025-02-27T00:08:13.644599461Z self._result_cache = list(self._iterable_class(self))
scancodeio-worker-1 | 2025-02-27T00:08:13.644601020Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644602359Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 91, in __iter__
scancodeio-worker-1 | 2025-02-27T00:08:13.644603821Z results = compiler.execute_sql(
scancodeio-worker-1 | 2025-02-27T00:08:13.644605194Z ^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644607170Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1574, in execute_sql
scancodeio-worker-1 | 2025-02-27T00:08:13.644608695Z cursor.execute(sql, params)
scancodeio-worker-1 | 2025-02-27T00:08:13.644610030Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 79, in execute
scancodeio-worker-1 | 2025-02-27T00:08:13.644611445Z return self._execute_with_wrappers(
scancodeio-worker-1 | 2025-02-27T00:08:13.644612741Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644614029Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 92, in _execute_with_wrappers
scancodeio-worker-1 | 2025-02-27T00:08:13.644615467Z return executor(sql, params, many, context)
scancodeio-worker-1 | 2025-02-27T00:08:13.644616865Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644618291Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
scancodeio-worker-1 | 2025-02-27T00:08:13.644619755Z return self.cursor.execute(sql, params)
scancodeio-worker-1 | 2025-02-27T00:08:13.644621095Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644622376Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/cursor.py", line 93, in execute
scancodeio-worker-1 | 2025-02-27T00:08:13.644623770Z self._conn.wait(
scancodeio-worker-1 | 2025-02-27T00:08:13.644625104Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/connection.py", line 409, in wait
scancodeio-worker-1 | 2025-02-27T00:08:13.644626515Z return waiting.wait(gen, self.pgconn.socket, interval=interval)
scancodeio-worker-1 | 2025-02-27T00:08:13.644627964Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scancodeio-worker-1 | 2025-02-27T00:08:13.644629352Z File "psycopg_binary/_psycopg/waiting.pyx", line 201, in psycopg_binary._psycopg.wait_c
scancodeio-worker-1 | 2025-02-27T00:08:13.644630831Z File "/opt/scancodeio/.venv/lib/python3.12/site-packages/rq/timeouts.py", line 63, in handle_death_penalty
scancodeio-worker-1 | 2025-02-27T00:08:13.644632733Z raise self._exception('Task exceeded maximum timeout value ({0} seconds)'.format(self._timeout))
scancodeio-worker-1 | 2025-02-27T00:08:13.644634107Z rq.timeouts.JobTimeoutException: Task exceeded maximum timeout value (86400 seconds)
This is also effecting docker image scans with large python installations (like pytorch based docker images).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status