Skip to content

Conversation

hunterckx
Copy link
Collaborator

@hunterckx hunterckx commented Aug 28, 2025

Description

  • Refactored TypeScript catalog build code for BRC to move some actions to functions that can be shared with GA2
  • Updated names of some NPM scripts to clearly distinguish BRC scripts from GA2
  • Added source YAML, Python build step, TypeScript build step, and build output for GA2 catalog

Additional notes:

  • Due to apparent Python import restrictions, the catalog files are located under catalog/ga2 rather than a separate top-level folder such as catalog_ga2
  • The GA2 catalog has its own requirements.txt, although it's currently a copy of the BRC one
  • I split build-files-from-ncbi.py from the GA2 repository into separate update_assemblies.py and build_files_from_ncbi.py, but omitted the fetch_object_storage_file_list function, since from what I can tell it's not ultimately used for anything as it is
  • The GA2 repository does not contain the SRA metadata TSV, but I've checked it in here
  • The types for GA2 entities are based on a combination of those from feat: set up ga2 site config (#751) #768, those from the GA2 repository, and adjustments that seemed important to me
  • The two utils modules under api/apis/catalog are also copied from feat: set up ga2 site config (#751) #768, in order to implement duplicate ID checks

Related Issue

Closes #752

@hunterckx hunterckx changed the title Hunter/752 add ga2 catalog feat: add ga2 catalog data and scripts (#752) Aug 28, 2025
feat: first version of an organism.yml file, we default the ploidy to DIPLOID since that is the most common for vertebrates
@github-actions github-actions bot added the feat label Aug 28, 2025
@hunterckx hunterckx marked this pull request as ready for review August 29, 2025 04:25
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds comprehensive support for the GA2 (Genome Ark 2) catalog by refactoring the existing BRC catalog build infrastructure to support shared functionality and adding complete GA2-specific implementations.

Key changes:

  • Refactored TypeScript catalog build code to extract shared functions that can be used by both BRC and GA2
  • Added GA2 catalog source data, build scripts (Python and TypeScript), and schema validation
  • Updated NPM scripts to distinguish between BRC and GA2 operations

Reviewed Changes

Copilot reviewed 23 out of 28 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
package.json Updated NPM scripts to differentiate BRC from GA2 operations and added new GA2-specific build commands
catalog/ga2/source/organisms.yml Added organism source data with taxonomy IDs and ploidy information for GA2 catalog
catalog/ga2/source/assemblies.yml Added assembly accession data for GA2 catalog with species comments
catalog/ga2/schema/scripts/ Added validation scripts for GA2 catalog schema
catalog/ga2/build/ts/ Added TypeScript build configuration, constants, and main build script for GA2
catalog/ga2/build/py/ Added Python build scripts and requirements for GA2 catalog generation
catalog/build/ts/utils.ts Refactored to extract shared utility functions for organism processing and validation
catalog/build/ts/constants.ts Refactored to separate core and BRC-specific source genome keys
catalog/build/ts/build-*.ts Updated to use shared utility functions
app/apis/catalog/ga2/utils.ts Added utility functions for GA2 entity ID and title generation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


ASSEMBLIES_PATH = "catalog/ga2/source/assemblies.yml"

UCSC_ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/BRC/assemblyList.json"
Copy link
Preview

Copilot AI Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL points to BRC assemblies but this is for GA2. The URL should point to the GA2-specific assembly list.

Suggested change
UCSC_ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/BRC/assemblyList.json"
UCSC_ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/ga2/assemblyList.json"

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously the suggested URL doesn't actually exist, and this UCSC URL was already in the GA2 repo and gives us a few UCSC links, but I don't think we've ever explicitly addressed whether it makes sense to have it here?

Copy link
Collaborator

@d-callan d-callan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didnt look at the ga2 dirs at all, but the brc changes seem reasonable to me.

@hunterckx
Copy link
Collaborator Author

Thanks @d-callan! @Smeds, please take a look at the GA2 part when you have time.

@hunterckx
Copy link
Collaborator Author

One more note @Smeds -- I replaced the few infraspecific taxonomy IDs in your organisms.yml with the corresponding species-level taxonomy IDs, since our concept of "organisms" corresponds to species and the IDs in organisms.yml are required to be species-level.

@NoopDog NoopDog merged commit 06b0851 into main Sep 1, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[GA2] Add a catalog-ga2 folder and migrate the GA2 build into it
4 participants