Skip to content

Conversation

@bluegenes
Copy link
Collaborator

@bluegenes bluegenes commented Mar 10, 2025

Fixes:

  • NCBI may block IP address due to gbsketch #215

  • ideas for improving retrieval through NCBI REST API #216

  • suggestion: name output .zip files from batches .sig.zip instead of .zip #191

  • more output desired #174

  • gbsketch now uses the dehydrate -> rehydrate approach. It first uses the NCBI API to download "dehydrated" file containing direct fetch links for all accessions. Then, we proceed to download gzipped FASTAs from fetch links, checking that downloads execute properly along the way.

  • Since it is not possible to get the gzipped md5sums from the dehydrated file, for gbsketch only we switch to relying on internal checksums (crc32) in the gz files, which are automatically checked by needletail as we process the file. To make sure we retry if something fails, I've moved the file parsing inside of the retry logic. This means that if needletail encounters an error with the file, we retry up to the number of max retries, logging the final error if it does not succeed. Unfortunately since needletail is sync, we still need to download all data for a given file, THEN parse to write/sketch (i.e. we cannot parse while streaming).

  • Since we are using direct download links, we can enable a much larger number of simultaneous downloads. Here I've enabled 1-30, following the max used by the NCBI datasets application. I've set the new default for both gbsketch and urlsketch to 10, since limiting to 3 before was not actually necessary.

  • For missing protein files (files where NCBI does not have a download fetch link because they do not exist), we now write empty URLs in the failures file. To continue to allow running urlsketch from a gbsketch failures file, I've added --force to urlsketch to skip past entries with empty URLs. However, since we can re-run gbsketch from the gbsketch failure file, that is the preferred/recommended approach.

Notes:

final fixes/clean up:

  • DNA and protein downloads in gbsketch are now handled separately, which means that batching will not be by accession but rather by download link, as in urlsketch. The batch tests were failing b/c they expected zips to batch by accession --> modified to account for changed behavior
  • fix occasional bug in batch restart (it used to sometimes write empty zipfile first. Caused by not needing to write the initial sigs b/c of presence in the existing batch. Fixed by returning info on whether sigs were actually written by async_write_sigs_to_zip)
  • enable*.sig.zip batched outputs + restart. Batches will be: filename.N.sig.zip
  • modify inputs to allow up to 30 simultaneous downloads and eliminate need for ncbi api key. Keep/strengthen memory limitation note in docs.
  • use .incomplete for in-progress zipfiles, rename when finished
  • add --verbose option to report for every accession
  • final review / clean up

Future: if getting the dehydrated file takes a while for large databases, we may want to consider writing an intermediate file with the download links to prevent needing to get the fetch links twice. see #228.

@bluegenes bluegenes changed the title WIP: use dehydrate approach MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach Apr 1, 2025
@bluegenes bluegenes changed the title MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach WIP: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach Apr 1, 2025
@bluegenes bluegenes changed the title WIP: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach Apr 1, 2025
@bluegenes
Copy link
Collaborator Author

@ctb ready for review! Should make gbsketch usable again with the changes to NCBI REST API.

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, that's a lot!

@bluegenes bluegenes merged commit 2281af3 into main Apr 2, 2025
1 check passed
@bluegenes bluegenes deleted the use-dehydrate-approach branch April 2, 2025 15:17
@bluegenes bluegenes mentioned this pull request Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants