16S rRNA gene sequence curation and phylogenetic reference set creation
- confirm availability of necessary libraries to compile dependencies
(on Ubuntu:
sudo apt-get install gfortran libopenblas-dev liblapack-dev) - Install Python >= 3.8 or Python 3 Virtual Environment
% python3 -m venv bin-env % source bin-env/bin/activate % bin/bootstrap.sh
the deenurp executable should now be on your $PATH
See required system libraries above.
First, install binary dependencies:
Python 3
pip, for installing python dependencies (http://www.pip-installer.org/)
Python packages:
- Run
pip install PACKAGEfor every PACKAGE listed in requirements.txt, e.g.cat requirements.txt | xargs -n 1 pip install
- Run
Infernal version 1.1 (http://infernal.janelia.org/)
pplacer suite (http://matsen.fhcrc.org/pplacer)
FastTree 2 (http://www.microbesonline.org/fasttree/#Install)
Optional (for filter-outliers and pairwise-distances):
- muscle (http://www.drive5.com/muscle/)
Finally, install:
python setup.py install
Deenurp can be run from a Docker image which can be built locally from the Dockerfile
or pulled docker pull nghoffman/deenurp:v0.3.0
Similarity-search based reference sequence selection
The deenurp package under the current directory provides to subcommands,
accessed via the script deenurp.py, or the command deenurp if installed.
Subcommands fall into two general categories:
- Building a set of reference sequences for use in refpkg building
- Selecting sequences for a specific reference package
Removes outlier sequences from a reference database
Expands poorly-represented names in a sequence file by similarity search
Cluster reference sequences, first by tax-id at a specified rank
(default: species), then by similarity for unnamed sequences or
sequences not classified to the desired rank. Serves as input to
search-sequences.
Builds a set of hierarchical reference packages.
Searches a set of sequences against a FASTA file containing possible reference sequences.
This subcommand does searches sequences against a reference FASTA
file, saving the results and some metadata to a sqlite database for
use in select-references
Given the output of search-sequences, select-references
attempts to find a good set of reference sequences.
For each reference cluster with a minimal amount of sequences having
best hits to the cluster, (see cluster-refs), selects a set number
of sequences to serve as references.
Taxa who are the sole descendent of their parent can complicate taxonomic classification.
The fill-lonely subcommand finds some company for these lonely
taxa.
Fetches sequences from a sequence file which match the taxtable for a reference set at a given rank. Useful for adding type strains.
Runs the tax2tree program on a reference package, updating the
seq_info file.
Sequences whose lineage changes are relabeled. The prior tax_id is
added to the seq_info file in the reference package.