MRG: fix `gbsketch` NCBI downloads by using dehydrate-rehydrate approach #222

bluegenes · 2025-03-10T23:38:01Z

Fixes:

NCBI may block IP address due to gbsketch #215
ideas for improving retrieval through NCBI REST API #216
suggestion: name output .zip files from batches .sig.zip instead of .zip #191
more output desired #174
gbsketch now uses the dehydrate -> rehydrate approach. It first uses the NCBI API to download "dehydrated" file containing direct fetch links for all accessions. Then, we proceed to download gzipped FASTAs from fetch links, checking that downloads execute properly along the way.
Since it is not possible to get the gzipped md5sums from the dehydrated file, for gbsketch only we switch to relying on internal checksums (crc32) in the gz files, which are automatically checked by needletail as we process the file. To make sure we retry if something fails, I've moved the file parsing inside of the retry logic. This means that if needletail encounters an error with the file, we retry up to the number of max retries, logging the final error if it does not succeed. Unfortunately since needletail is sync, we still need to download all data for a given file, THEN parse to write/sketch (i.e. we cannot parse while streaming).
Since we are using direct download links, we can enable a much larger number of simultaneous downloads. Here I've enabled 1-30, following the max used by the NCBI datasets application. I've set the new default for both gbsketch and urlsketch to 10, since limiting to 3 before was not actually necessary.
For missing protein files (files where NCBI does not have a download fetch link because they do not exist), we now write empty URLs in the failures file. To continue to allow running urlsketch from a gbsketch failures file, I've added --force to urlsketch to skip past entries with empty URLs. However, since we can re-run gbsketch from the gbsketch failure file, that is the preferred/recommended approach.

Notes:

There is an md5sum.txt downloaded with the dehydrated zipfile, but the md5sums are for non-gz files, despite the md5sum.txt on the ftp site representing gzipped files (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/md5checksums.txt).

final fixes/clean up:

DNA and protein downloads in gbsketch are now handled separately, which means that batching will not be by accession but rather by download link, as in urlsketch. The batch tests were failing b/c they expected zips to batch by accession --> modified to account for changed behavior
fix occasional bug in batch restart (it used to sometimes write empty zipfile first. Caused by not needing to write the initial sigs b/c of presence in the existing batch. Fixed by returning info on whether sigs were actually written by async_write_sigs_to_zip)
enable*.sig.zip batched outputs + restart. Batches will be: filename.N.sig.zip
modify inputs to allow up to 30 simultaneous downloads and eliminate need for ncbi api key. Keep/strengthen memory limitation note in docs.
use .incomplete for in-progress zipfiles, rename when finished
add --verbose option to report for every accession
final review / clean up

Future: if getting the dehydrated file takes a while for large databases, we may want to consider writing an intermediate file with the download links to prevent needing to get the fetch links twice. see #228.

bluegenes · 2025-04-01T22:42:21Z

@ctb ready for review! Should make gbsketch usable again with the changes to NCBI REST API.

ctb

wow, that's a lot!

init

988bace

bluegenes mentioned this pull request Mar 10, 2025

ideas for improving retrieval through NCBI REST API #216

Closed

bluegenes added 11 commits March 31, 2025 13:57

in progress...

5c249e3

upd download_and_process_with_retry

88dc17f

closer...

c5d7656

properly handle failure to parse dehydrated zip

6bb6e59

add test for empty url

8d39122

fix bug where we wrote an empty batch if restarting from prior batch

8aab112

enable sig.zip batching

409d251

fix formatting

bdc83a7

clean up, clippy fixes, etc

1a40a24

up n simultaneous downloads

4ddecbe

upd doc; remove debugging prints

8386c2d

bluegenes mentioned this pull request Apr 1, 2025

suggestion: name output .zip files from batches .sig.zip instead of .zip #191

Closed

bluegenes changed the title ~~WIP: use dehydrate approach~~ MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach Apr 1, 2025

bluegenes changed the title ~~MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach~~ WIP: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach Apr 1, 2025

bluegenes added 4 commits April 1, 2025 14:38

use .incomplete for in-progress zipfiles

54bd801

add --verbose to print progress for every download

3848612

more cleanup

6c344f9

add test to make sure we can run gbsketch from a gbsketch failures file

f6375d9

bluegenes mentioned this pull request Apr 1, 2025

gbsketch: write intermediate file with fetch links? #228

Closed

bluegenes changed the title ~~WIP: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach~~ MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach Apr 1, 2025

ctb approved these changes Apr 2, 2025

View reviewed changes

bluegenes merged commit 2281af3 into main Apr 2, 2025
1 check passed

bluegenes deleted the use-dehydrate-approach branch April 2, 2025 15:17

bluegenes mentioned this pull request Apr 2, 2025

more output desired #174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MRG: fix `gbsketch` NCBI downloads by using dehydrate-rehydrate approach #222

MRG: fix `gbsketch` NCBI downloads by using dehydrate-rehydrate approach #222

Uh oh!

bluegenes commented Mar 10, 2025 •

edited

Loading

Uh oh!

bluegenes commented Apr 1, 2025

Uh oh!

ctb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach #222

MRG: fix gbsketch NCBI downloads by using dehydrate-rehydrate approach #222

Uh oh!

Conversation

bluegenes commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluegenes commented Apr 1, 2025

Uh oh!

ctb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MRG: fix `gbsketch` NCBI downloads by using dehydrate-rehydrate approach #222

MRG: fix `gbsketch` NCBI downloads by using dehydrate-rehydrate approach #222

bluegenes commented Mar 10, 2025 •

edited

Loading