GitHub - galaxyproject/KegAlign: fork of the GPU genome aligner

This is a @galaxyproject's modified fork of the original SegAlign.

Overview

Precise genome aligner efficiently leveraging GPUs.

Changes from the original implementation

Added advanced runner script allowing the usage of MIG and/or MPS for better GPU utilization
Updated to compile with TBB (Threading Building Blocks) version 2020.2
Fixed the --scoring option. It can now read and use the substitution matrix from a LASTZ Scoring File
Added --num_threads option to limit the number of threads used
Added --segment_size option to limit maximum number of HSPs per segment file for CPU load balancing
Cleaned up build files and addressed compiler warnings

Installation

For standalone installation use Conda: conda install conda-forge::kegalign

For standalone installation with additional tools use Bioconda: conda install bioconda::kegalign-full

For installation in Galaxy we currently use the wrappers richard-burhans:kegalign and richard-burhans:batched_lastz from the Main Tool Shed. Try the tools at usegalaxy.org: kegalign, batched_lastz

Script to create conda environment

git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash
source ./conda-env.bash

Script to install development enviroment

git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash -dev
source ./conda-env-dev.bash

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

Dependencies

The following dependencies are required by KegAlign:

CMake >= 3.8
oneAPI Threading Building Blocks (oneTBB) 2020.2
Boost C++ Libraries >= 1.70
LASTZ 1.04.22
faToTwoBit (from UCSC Genome Browser source)

Usage

Alignment

Computing a Sample Alignment

Installing KegAlign

# install kegalign
git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash
source ./conda-env.bash

Converting input sequences

The first step is to convert the input sequences to .2bit, a compact randomly-accessible format. These .2bit files are used later as input to LASTZ.

# convert target (ref) and query to 2bit
mkdir work
faToTwoBit <(gzip -cdfq ./test-data/apple.fasta.gz) work/ref.2bit
faToTwoBit <(gzip -cdfq ./test-data/orange.fasta.gz) work/query.2bit

Generating LASTZ commands

The second step, has two sub-steps. First, use KegAlign to generate a list of LASTZ commands to run. Second, adjust this list of LASTZ commands using our diagonal partitioning method. Two ways to complete this step are shown below.

You can run the python scripts used by Galaxy to generate a keg (tarball) containing the LASTZ commands.

# generate LASTZ keg
python ./scripts/runner.py --diagonal-partition --format maf- --num-cpu 16 --num-gpu 1 --output-file data_package.tgz --output-type tarball --tool_directory ./scripts test-data/apple.fasta.gz test-data/orange.fasta.gz
python ./scripts/package_output.py --format_selector maf --tool_directory ./scripts

You can run KegAlign followed by our diagonal partitioning python script to generate the list of LASTZ commands.

# command-line kegalign
kegalign test-data/apple.fasta.gz test-data/orange.fasta.gz work/ --num_gpu 1 --num_threads 16 > lastz-commands.txt
xargs -d "\n" -n 1 python ./scripts/diagonal_partition.py -1 < lastz-commands.txt > new-lastz-commands.txt
mv new-lastz-commands.txt lastz-commands.txt

Computing the alignment

The third step is to use LASTZ to compute the alignment.

If you have a keg (tarball) from step 1 above, you can use the python script used by Galaxy to compute the alignment.

# run LASTZ keg
python ./scripts/run_lastz_tarball.py --input=data_package.tgz --output=apple_orange.maf --parallel=16

If you have a list of LASTZ commands from step 2 above, you can compute the alignment.

This runs the LASTZ commands serially.

# run LASTZ commands
bash lastz-commands.txt
(echo "##maf version=1"; cat *.maf-) > apple_orange.maf

This runs the LASTZ commands using GNU parallel:

# run LAST commands
parallel --max-procs 16 < lastz-commands.txt
(echo "##maf version=1"; cat *.maf-) > apple_orange.maf

Checking the output

# check output
diff apple_orange.maf <(gzip -cdfq ./test-data/apple_orange.maf.gz)

Running with MIG/MPS

GPU utilization can be increased by using MIG and/or MPS, leading up to 20% faster alignments.

Preparing inputs

With the provided split_input.py script we assign individual chromosomes from the input genome into separate fasta files (up to --max_chunks), each with roughly --goal_bp number of base pairs, which will then be run in parallel on the same GPU(s). Since individual chromosomes are not split, the --goal_bp parameter should not be significantly smaller than the largest chromosome in the input file to ensure similar sized chunks. A good --goal_bp size for the human genome is 200 million base pairs.

mkdir query_split target_split
./scripts/mps-mig/split_input.py --input ./test-data/apple.fasta.gz --out query_split --to_2bit --goal_bp 20000000 --max_chunks 30
./scripts/mps-mig/split_input.py --input ./test-data/orange.fasta.gz --out target_split --to_2bit --goal_bp 20000000 --max_chunks 30
mkdir tmp

Select GPU UUIDs to run on using

nvidia-smi -L

run on two GPUs with 4 MPS processes per GPU (replace [GPU-UUID#] with outputs from above command)

Each KegAlign instance, with default settings, uses around 12 to 16 GiB of GPU memory. The chosen GPUs or MIG instances should each have enough GPU memory to run the number of KegAlign instances defined by the --MPS parameter.

python ./scripts/mps-mig/run_mig.py [GPU-UUID1],[GPU-UUID2] --MPS 4 --target ./target_split --query ./query_split  --tmp_dir ./tmp/ --mps_pipe_dir ./tmp/ --output ./apples_oranges.maf --num_threads 64

Scoring Options

By default the HOXD70 substitution scores are used (from Chiaromonte et al. 2002)

bad_score          = X:-1000  # used for sub['X'][*] and sub[*]['X']
fill_score         = -100     # used when sub[*][*] is not defined
gap_open_penalty   =  400
gap_extend_penalty =   30

     A     C     G     T
A   91  -114   -31  -123
C -114   100  -125   -31
G  -31  -125   100  -114
T -123   -31  -114    91

Matrix can be supplied as an input to --scoring parameter. Substitution matrix can be inferred from your data using another LASTZ-based tool (LASTZ_D: Infer substitution scores).

Output Options

The default output is a MAF alignment file. Other formats can be selected with the --format parameter. See LASTZ manual for description of possible formats.

License

This project is released under the MIT License.
See the LICENSE file for details.

Citing KegAlign

AB Gulhan, R Burhans, R Harris, M Kandemir, M Haeussler, A Nekrutenko. KegAlign: Optimizing pairwise alignments with diagonal partitioning. BIORXIV, 2024. doi: 10.1101/2024.09.02.610839

Name		Name	Last commit message	Last commit date
Latest commit History 521 Commits
common		common
scripts		scripts
src		src
test-data		test-data
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
add-option.patch		add-option.patch
include.patch		include.patch
kegalign_logo.png		kegalign_logo.png
kegalign_logo.webp		kegalign_logo.webp
kegalign_logo_modified.svg		kegalign_logo_modified.svg
make-faToTwoBit.bash		make-faToTwoBit.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Changes from the original implementation

Installation

Dependencies

Usage

Alignment

Computing a Sample Alignment

Installing KegAlign

Converting input sequences

Generating LASTZ commands

Computing the alignment

Checking the output

Running with MIG/MPS

Scoring Options

Output Options

License

Citing KegAlign

About

Uh oh!

Releases 2

Uh oh!

Languages

License

galaxyproject/KegAlign

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Changes from the original implementation

Installation

Dependencies

Usage

Alignment

Computing a Sample Alignment

Installing KegAlign

Converting input sequences

Generating LASTZ commands

Computing the alignment

Checking the output

Running with MIG/MPS

Scoring Options

Output Options

License

Citing KegAlign

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages