Kmerseek demo example for YC S25 Application
Create a conda environment:
conda create -n 2025-kmerseek-yc-demo -c conda-forge -c bioconda --yes polars sourmash click pandas sourmash_plugin_branchwater
Activate the environment:
conda activate 2025-kmerseek-yc-demo
Then install the kmerseek branch:
pip install git+https://github.com/seanome/kmerseek.git@olgabot/initial-demo
Create an index of a subset of human cell death genes with make index or:
kmerseek \
index \
--ksize 16 \
bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz
`kmerseek index` output
kmerseek \
index \
--ksize 16 \
bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz
Loaded 1 rows in total (0 genome and 1 protein files)
Building 1 sketch types:
hp,k=16,scaled=5,num=0,abund=true
Starting file 1/1 (100%)
Writing manifest
DONE. Processed 1 fasta files
loaded 25 signatures total, from 1 files
loaded and merged 25 signatures
merged signature has the following properties:
k=16 molecule=hp num=0 scaled=5 seed=42
total hashes in merged signature: 1603
now processing sequence files for matches!
opening sequence file 'bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz'
DONE.
searched 25 sequences from 1 files, containing a total of 9.3 kbp.
matched and saved a total of 25 sequences with 9.3 kbp.
matched and saved a total of 1760 k-mers.
found 1603 distinct matching hashes (100.0%)
indexing all sketches in 'bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip'
Loading sketches from bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip.siglist
Reading analysis(s) from: 'bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip.siglist'
FAILED to load as JSON files; falling back to general recursive
Loaded 25 analysis signature(s)
Found 25 sketches total.
Indexing 25 sketches.
Using external storage - not copying sketches.
...index is done! results in 'bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip.rocksdb'
Now search for the marine gene in human genes with make search or
kmerseek \
search \
--ksize 16 \
ced9.fasta \
bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz
`kmerseek search` output
kmerseek \
search \
--ksize 16 \
ced9.fasta \
bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz
Loaded 1 rows in total (0 genome and 1 protein files)
Building 1 sketch types:
hp,k=16,scaled=5,num=0,abund=true
Starting file 1/1 (100%)
Writing manifest
DONE. Processed 1 fasta files
2025-03-04 15:09:01 - kmerseek.logging - INFO - Calling get_kmers_cli on ced9.fasta.hp.k16.scaled5.sig.zip with ced9.fasta
2025-03-04 15:09:01 - kmerseek.logging - INFO - Saving matches to /var/folders/rl/81r400y52z38l8_kwn4g1xdc0000gn/T/tmp7msljbul.csv
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures
merged signature has the following properties:
k=16 molecule=hp num=0 scaled=5 seed=42
total hashes in merged signature: 49
now processing sequence files for matches!
opening sequence file 'ced9.fasta'
DONE.
searched 1 sequences from 1 files, containing a total of 280 bp.
matched and saved a total of 1 sequences with 280 bp.
matched and saved a total of 49 k-mers.
found 49 distinct matching hashes (100.0%)
2025-03-04 15:09:01 - kmerseek.logging - INFO - Reading in k-mers, adding hp encoded values
2025-03-04 15:09:01 - kmerseek.logging - INFO - Found signature file bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip, skipping! Re-make with '--force'
selection scaled: Some(5)
Reading query(s) from: 'ced9.fasta.hp.k16.scaled5.sig.zip.siglist'
FAILED to load as JSON files; falling back to general recursive
Loaded 1 query signature(s)
Reading search(s) from: 'bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip.siglist'
FAILED to load as JSON files; falling back to general recursive
Loaded 25 search signature(s)
DONE. Processed 25 search sigs
2025-03-04 15:09:01 - kmerseek.logging - INFO - Found k-mer parquet bcl2_first25_uniprotkb_accession_O43236_OR_accession_2025_02_06.fasta.gz.hp.k16.scaled5.sig.zip.kmers.pq, skipping! Re-make with '--force'
---
Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q9UK96|FBX10_HUMAN F-box only protein 10 OS=Homo sapiens OX=9606 GN=FBXO10 PE=1 SV=3
query: MSIGESIDGKINDWEEPGIVGVVVCGRMMFSLK (59-92)
alpha: hphhpphphphpphpphhhhhhhhphphhhphp
match: PNWPNQPDVEPESWREAAGIYILYHGNPVVSGN (57-90)
---
Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q12982|BNIP2_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 2 OS=Homo sapiens OX=9606 GN=BNIP2 PE=1 SV=1
query: RLDIEGFVVDYFTHRILFVYTSLFIKTRIRNN (76-108)
alpha: phphphhhhphhppphhhhhpphhhppphppp
match: SIEADILAITGPEDQPLLAVTRPFISSKFSQK (23-55)
---
Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q9BXH1|BBC3_HUMAN Bcl-2-binding component 3, isoforms 1/2 OS=Homo sapiens OX=9606 GN=BBC3 PE=1 SV=1
query: LIGLISFGGFVAAKMME (170-187)
alpha: hhhhhphhhhhhhphhp
match: APAAPTLLPAAYLCAPT (46-63)
---
Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q13625|ASPP2_HUMAN Apoptosis-stimulating of p53 protein 2 OS=Homo sapiens OX=9606 GN=TP53BP2 PE=1 SV=2
query: KVGRRKQNRRWSMIGA (241-257)
alpha: phhppppppphphhhh
match: TIIHREDEDEIEWWWA (1084-1100)
---
Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q16611|BAK_HUMAN Bcl-2 homologous antagonist/killer OS=Homo sapiens OX=9606 GN=BAK1 PE=1 SV=1
query: RKQNRRWSMIGAGVTA (245-261)
alpha: pppppphphhhhhhph
match: HQQEQEAEGVAAPADP (42-58)
match_name,query_name,query_start,query_end,query,match_start,match_end,match,encoded,length
sp|Q9UK96|FBX10_HUMAN F-box only protein 10 OS=Homo sapiens OX=9606 GN=FBXO10 PE=1 SV=3,sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1,59,92,MSIGESIDGKINDWEEPGIVGVVVCGRMMFSLK,57,90,PNWPNQPDVEPESWREAAGIYILYHGNPVVSGN,hphhpphphphpphpphhhhhhhhphphhhphp,33
sp|Q12982|BNIP2_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 2 OS=Homo sapiens OX=9606 GN=BNIP2 PE=1 SV=1,sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1,76,108,RLDIEGFVVDYFTHRILFVYTSLFIKTRIRNN,23,55,SIEADILAITGPEDQPLLAVTRPFISSKFSQK,phphphhhhphhppphhhhhpphhhppphppp,32
"sp|Q9BXH1|BBC3_HUMAN Bcl-2-binding component 3, isoforms 1/2 OS=Homo sapiens OX=9606 GN=BBC3 PE=1 SV=1",sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1,170,187,LIGLISFGGFVAAKMME,46,63,APAAPTLLPAAYLCAPT,hhhhhphhhhhhhphhp,17
sp|Q13625|ASPP2_HUMAN Apoptosis-stimulating of p53 protein 2 OS=Homo sapiens OX=9606 GN=TP53BP2 PE=1 SV=2,sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1,241,257,KVGRRKQNRRWSMIGA,1084,1100,TIIHREDEDEIEWWWA,phhppppppphphhhh,16
sp|Q16611|BAK_HUMAN Bcl-2 homologous antagonist/killer OS=Homo sapiens OX=9606 GN=BAK1 PE=1 SV=1,sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1,245,261,RKQNRRWSMIGAGVTA,42,58,HQQEQEAEGVAAPADP,pppppphphhhhhhph,16
If you like, also run everything with make all.
The kmerseek_search.csv shows example output, which looks like this:
| query_name | query_md5 | match_name | containment | intersect_hashes | ksize | scaled | moltype | match_md5 | jaccard | max_containment | average_abund | median_abund | std_abund | query_containment_ani | match_containment_ani | average_containment_ani | max_containment_ani | n_weighted_found | total_weighted_hashes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1 | fe3714626e8180caf90f78091563aae6 | sp|Q9BXH1|BBC3_HUMAN Bcl-2-binding component 3, isoforms 1/2 OS=Homo sapiens OX=9606 GN=BBC3 PE=1 SV=1 | 0.04081632653 | 2 | 48 | 5 | hp | 1d49aa1205276b9ba0176c6680cacd6d | 0.0243902439 | 0.05714285714 | 1 | 1 | 0 | 0.935532846 | 0.9421138187 | 0.9388233324 | 0.9421138187 | 2 | 35 |
| sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1 | fe3714626e8180caf90f78091563aae6 | sp|Q12982|BNIP2_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 2 OS=Homo sapiens OX=9606 GN=BNIP2 PE=1 SV=1 | 0.04081632653 | 2 | 48 | 5 | hp | 7bbc6e2ea3a472034fc31321943032ee | 0.02040816327 | 0.04081632653 | 1 | 1 | 0 | 0.935532846 | 0.9347534561 | 0.935143151 | 0.935532846 | 2 | 51 |
| sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1 | fe3714626e8180caf90f78091563aae6 | sp|Q13625|ASPP2_HUMAN Apoptosis-stimulating of p53 protein 2 OS=Homo sapiens OX=9606 GN=TP53BP2 PE=1 SV=2 | 0.02040816327 | 1 | 48 | 5 | hp | 35da5dcf3561c6c0b0aaa34a118eabef | 0.003610108303 | 0.02040816327 | 1 | 1 | 0 | 0.9221202974 | 0.8929697781 | 0.9075450378 | 0.9221202974 | 1 | 230 |
| sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1 | fe3714626e8180caf90f78091563aae6 | sp|Q9UK96|FBX10_HUMAN F-box only protein 10 OS=Homo sapiens OX=9606 GN=FBXO10 PE=1 SV=3 | 0.0612244898 | 3 | 48 | 5 | hp | 97f5f83c6214d6792113785b96747383 | 0.01435406699 | 0.0612244898 | 1 | 1 | 0 | 0.9434689411 | 0.9201376139 | 0.9318032775 | 0.9434689411 | 3 | 164 |
| sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1 | fe3714626e8180caf90f78091563aae6 | sp|Q16611|BAK_HUMAN Bcl-2 homologous antagonist/killer OS=Homo sapiens OX=9606 GN=BAK1 PE=1 SV=1 | 0.02040816327 | 1 | 48 | 5 | hp | 1f59cdb10b02a7c6baff18b034518599 | 0.01111111111 | 0.02380952381 | 1 | 1 | 0 | 0.9221202974 | 0.9250864216 | 0.9236033595 | 0.9250864216 | 1 | 42 |