Release Highlights
Release tag: v2.1.0
This release includes a single configuration change that allows deploys to specify the location of the queues that hold input for the crawler.
Upgrade Notes
No Action Required.
Optionally, you can set any of the new environment variables. See sections below that describe how these are used.
CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS
(default 0) - The default is the same as the hardcoded value in the previous release. Recommended value is 5.CRAWLER_LICENSEE_PARALLELISM
(default 10) - The default is the same as the hardcoded value in the previous releaseCRAWLER_AZBLOB_IS_SPN_AUTH
(defaultfalse
) - boolean which needs to betrue
if using SPN authCRAWLER_AZBLOB_SPN_AUTH
(default‘’
) - The Azure SPN for the Azblob will only be used ifCRAWLER_AZBLOB_IS_SPN_AUTH
istrue
CRAWLER_QUEUE_AZURE_IS_SPN_AUTH
(defaultfalse
) -boolean needs to betrue
if using SPN authCRAWLER_QUEUE_AZURE_SPN_AUTH
(default‘’
) - The Azure SPN for the queue account will only be used ifCRAWLER_QUEUE_AZURE_IS_SPN_AUTH
istrue
What’s changed
Changes: v2.0.1..v2.1.0
Minor Changes
Add support for visibility timeout in AzureQueueStore
- Add support for visibility timeout in AzureQueueStore (#639) @qtomlinson
- Updated the default of CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS (#642) @qtomlinson
There are 4 tools the crawler runs each of which generate tool results for package-version coordinates. As tools complete, a message is placed on the harvest queue identifying the coordinates and tool. The service will compute the definition if the current definition does not already have the results for that tool (at the tool’s version). If the tools complete spread across time, there is the potential for the service to compute the definition for the coordinates up to 4 times.
The addition of a visibility timeout for the AzureQueueStore allows messages to be hidden for a specified duration after being pushed onto the harvest queue. This allows for more of the crawler results to complete before the service computes the definition. This has the potential to reduce the number of definition computes from 4 down to 1. This enhancement works in conjunction with the improved definition computation on the service side, reducing the number of definition computations when component harvest results are available.
The visibility timeout for the queue store is controlled by the CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS
environment variable. The current default is set to 300 seconds (5 minutes). To maintain the existing behavior of not hiding messages upon adding them to the queue, set CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS
to 0.
Support SPN Authentication while maintaining Connection String approach for backward compatibility
- Upgrade Azure Storage SDK to a modern version (#629) @RomanIakovlev, @ljones140
- Separate crawler queue connection from harvest (#633) @ljones140
This change is a prerequisite to enable token-based authentication for the Azure Storage operations (blobs and queues). It updates the SDK to a modern version that is currently supported. See Issue #622 for information on why this change was necessary.
Old approach:
Prior to this change the connection string for both blobs and queues was used for authentication and was set in the environment variable CRAWLER_AZBLOB_CONNECTION_STRING
. This approach is still allowed, but not recommended.
New approach:
The recommended best practice is to use SPN for authentication. As part of this work, authentication can now be done separately for blob storage and queues, or you can use the same SPN for both. To use this for authentication, set up the following environment variables.
To use SPN authentication for Azure Blob storage, set the following two environment variables..
CRAWLER_AZBLOB_IS_SPN_AUTH
(defaultfalse
) - boolean needs to betrue
if using SPN authCRAWLER_AZBLOB_SPN_AUTH
(default‘’
) - Azure SPN for the Azblob will only be used ifCRAWLER_AZBLOB_IS_SPN_AUTH
istrue
To use SPN authentication for Azure queues storage, set the following two environment variables..
CRAWLER_QUEUE_AZURE_IS_SPN_AUTH
(defaultfalse
) - boolean needs to betrue
if using SPN authCRAWLER_QUEUE_AZURE_SPN_AUTH
(default‘’
) - Azure SPN for the queue account will only be used ifCRAWLER_QUEUE_AZURE_IS_SPN_AUTH
istrue
You can use different SPN for each or use the same for both.
Update fetch file to centralize default headers (#620) @yashkohli88
Refactor all fetch functions into fetch.js and use the centralized user-agent header across fetch functions for various package managers.
Persist manifest information for sourcearchive components (#625) @qtomlinson
Update Tool Version for mavenExtract and nugetExtract (#631) @yashkohli88
Tool versions updated to 1.3.1 for mavenExtract and 1.2.3 for nugetExtract
Make licensee's max degree of parallelism configurable (#641) @jkbschmid
The max degree of parallelism for the licensee tool was hardcoded to 10. This adds CRAWLER_LICENSEE_PARALLELISM
environment variable to allow for more control of crawler behavior. The default is 10 to maintain backward compatibility.
Patch Updates
- Use the latest @clearlydefined/spdx 0.1.10 (#643) @qtomlinson