Skip to content

Commit d1d917b

Browse files
committed
feat: build tsv incorporating assembly list json (#5)
1 parent 713725f commit d1d917b

File tree

7 files changed

+851
-787
lines changed

7 files changed

+851
-787
lines changed

data-catalog/.eslintignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,5 @@
22
**/.next/*
33

44
/out
5+
6+
venv

data-catalog/.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,7 @@ npm-debug.log*
3030
/public/favicons/*
3131

3232
# Build Dir
33-
/out
33+
/out
34+
35+
# python
36+
venv

data-catalog/.prettierignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,6 @@ node_modules
66

77
# build
88
/out
9+
10+
# python
11+
venv

data-catalog/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# BRC Analytics Data Catalog
2+
3+
## Building the data source files
4+
5+
In the `data-catalog` directory, create a Python virtual environment and install requirements:
6+
7+
```shell
8+
python3 -m venv ./venv
9+
source ./venv/bin/activate
10+
pip install -r ./files/requirements.txt
11+
```
12+
13+
Then run the script:
14+
15+
```shell
16+
python3 ./files/build-genomes-files.py
17+
```
18+
19+
The environment can be deactivated by running `deactivate`, and re-activated by running `source ./venv/bin/activate` again.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
import pandas as pd
2+
import re
3+
import requests
4+
5+
GENOMES_SOURCE_URL = "https://docs.google.com/spreadsheets/d/1NRfTvebPl6zJ0l9tCqBtq6YCrwV6_XDBlheq3L5HcvQ/gviz/tq?tqx=out:csv&sheet=GenomeDataTypes_Summary.csv"
6+
ASSEMBLIES_URL = "https://hgdownload.soe.ucsc.edu/hubs/BRC/assembly.list.json"
7+
8+
OUTPUT_PATH = "files/source/genomes.tsv"
9+
10+
def build_genomes_files():
11+
print("Building files")
12+
13+
genomes_source_df = pd.read_csv(GENOMES_SOURCE_URL, keep_default_na=False, usecols=lambda name: re.fullmatch(r"Unnamed: \d+", name) is None)
14+
assemblies_df = pd.DataFrame(requests.get(ASSEMBLIES_URL).json()["data"])
15+
16+
gen_bank_merge_df = genomes_source_df.merge(assemblies_df, how="left", left_on="Genome Version/Assembly ID", right_on="genBank")
17+
ref_seq_merge_df = genomes_source_df.merge(assemblies_df, how="left", left_on="Genome Version/Assembly ID", right_on="refSeq")
18+
19+
result_df = gen_bank_merge_df.combine_first(ref_seq_merge_df)
20+
21+
result_df.to_csv(OUTPUT_PATH, index=False, sep="\t")
22+
23+
print(f"Wrote to {OUTPUT_PATH}")
24+
25+
if __name__ == "__main__":
26+
build_genomes_files()

data-catalog/files/requirements.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
certifi==2024.8.30
2+
charset-normalizer==3.3.2
3+
idna==3.8
4+
numpy==2.1.0
5+
pandas==2.2.2
6+
python-dateutil==2.9.0.post0
7+
pytz==2024.1
8+
requests==2.32.3
9+
six==1.16.0
10+
tzdata==2024.1
11+
urllib3==2.2.2

data-catalog/files/source/genomes.tsv

Lines changed: 786 additions & 786 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)