-
Notifications
You must be signed in to change notification settings - Fork 8
feat: add ga2 catalog data and scripts (#752) #772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
feat: first version of an organism.yml file, we default the ploidy to DIPLOID since that is the most common for vertebrates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds comprehensive support for the GA2 (Genome Ark 2) catalog by refactoring the existing BRC catalog build infrastructure to support shared functionality and adding complete GA2-specific implementations.
Key changes:
- Refactored TypeScript catalog build code to extract shared functions that can be used by both BRC and GA2
- Added GA2 catalog source data, build scripts (Python and TypeScript), and schema validation
- Updated NPM scripts to distinguish between BRC and GA2 operations
Reviewed Changes
Copilot reviewed 23 out of 28 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
package.json | Updated NPM scripts to differentiate BRC from GA2 operations and added new GA2-specific build commands |
catalog/ga2/source/organisms.yml | Added organism source data with taxonomy IDs and ploidy information for GA2 catalog |
catalog/ga2/source/assemblies.yml | Added assembly accession data for GA2 catalog with species comments |
catalog/ga2/schema/scripts/ | Added validation scripts for GA2 catalog schema |
catalog/ga2/build/ts/ | Added TypeScript build configuration, constants, and main build script for GA2 |
catalog/ga2/build/py/ | Added Python build scripts and requirements for GA2 catalog generation |
catalog/build/ts/utils.ts | Refactored to extract shared utility functions for organism processing and validation |
catalog/build/ts/constants.ts | Refactored to separate core and BRC-specific source genome keys |
catalog/build/ts/build-*.ts | Updated to use shared utility functions |
app/apis/catalog/ga2/utils.ts | Added utility functions for GA2 entity ID and title generation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
||
ASSEMBLIES_PATH = "catalog/ga2/source/assemblies.yml" | ||
|
||
UCSC_ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/BRC/assemblyList.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL points to BRC assemblies but this is for GA2. The URL should point to the GA2-specific assembly list.
UCSC_ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/BRC/assemblyList.json" | |
UCSC_ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/ga2/assemblyList.json" |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously the suggested URL doesn't actually exist, and this UCSC URL was already in the GA2 repo and gives us a few UCSC links, but I don't think we've ever explicitly addressed whether it makes sense to have it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didnt look at the ga2 dirs at all, but the brc changes seem reasonable to me.
One more note @Smeds -- I replaced the few infraspecific taxonomy IDs in your organisms.yml with the corresponding species-level taxonomy IDs, since our concept of "organisms" corresponds to species and the IDs in organisms.yml are required to be species-level. |
Description
Additional notes:
catalog/ga2
rather than a separate top-level folder such ascatalog_ga2
build-files-from-ncbi.py
from the GA2 repository into separateupdate_assemblies.py
andbuild_files_from_ncbi.py
, but omitted thefetch_object_storage_file_list
function, since from what I can tell it's not ultimately used for anything as it isutils
modules underapi/apis/catalog
are also copied from feat: set up ga2 site config (#751) #768, in order to implement duplicate ID checksRelated Issue
Closes #752