This is a @galaxyproject's modified fork of the original SegAlign.
Precise genome aligner efficiently leveraging GPUs.
- Added advanced runner script allowing the usage of MIG and/or MPS for better GPU utilization
- Updated to compile with TBB (Threading Building Blocks) version 2020.2
- Fixed the --scoring option. It can now read and use the substitution matrix from a LASTZ Scoring File
- Added --num_threads option to limit the number of threads used
- Added --segment_size option to limit maximum number of HSPs per segment file for CPU load balancing
- Cleaned up build files and addressed compiler warnings
For standalone installation use Conda: conda install conda-forge::kegalign
For standalone installation with additional tools use Bioconda: conda install bioconda::kegalign-full
For installation in Galaxy we currently use the wrappers richard-burhans:kegalign
and richard-burhans:batched_lastz
from the Main Tool Shed.
Try the tools at usegalaxy.org: kegalign, batched_lastz
- Script to create conda environment
git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash
source ./conda-env.bash
- Script to install development enviroment
git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash -dev
source ./conda-env-dev.bash
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
The following dependencies are required by KegAlign:
- CMake >= 3.8
- oneAPI Threading Building Blocks (oneTBB) 2020.2
- Boost C++ Libraries >= 1.70
- LASTZ 1.04.22
- faToTwoBit (from UCSC Genome Browser source)
# install kegalign
git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash
source ./conda-env.bash
The first step is to convert the input sequences to .2bit, a compact randomly-accessible format. These .2bit files are used later as input to LASTZ.
# convert target (ref) and query to 2bit
mkdir work
faToTwoBit <(gzip -cdfq ./test-data/apple.fasta.gz) work/ref.2bit
faToTwoBit <(gzip -cdfq ./test-data/orange.fasta.gz) work/query.2bit
The second step, has two sub-steps. First, use KegAlign to generate a list of LASTZ commands to run. Second, adjust this list of LASTZ commands using our diagonal partitioning method. Two ways to complete this step are shown below.
- You can run the python scripts used by Galaxy to generate a keg (tarball) containing the LASTZ commands.
# generate LASTZ keg
python ./scripts/runner.py --diagonal-partition --format maf- --num-cpu 16 --num-gpu 1 --output-file data_package.tgz --output-type tarball --tool_directory ./scripts test-data/apple.fasta.gz test-data/orange.fasta.gz
python ./scripts/package_output.py --format_selector maf --tool_directory ./scripts
- You can run KegAlign followed by our diagonal partitioning python script to generate the list of LASTZ commands.
# command-line kegalign
kegalign test-data/apple.fasta.gz test-data/orange.fasta.gz work/ --num_gpu 1 --num_threads 16 > lastz-commands.txt
xargs -d "\n" -n 1 python ./scripts/diagonal_partition.py -1 < lastz-commands.txt > new-lastz-commands.txt
mv new-lastz-commands.txt lastz-commands.txt
The third step is to use LASTZ to compute the alignment.
If you have a keg (tarball) from step 1 above, you can use the python script used by Galaxy to compute the alignment.
# run LASTZ keg
python ./scripts/run_lastz_tarball.py --input=data_package.tgz --output=apple_orange.maf --parallel=16
If you have a list of LASTZ commands from step 2 above, you can compute the alignment.
This runs the LASTZ commands serially.
# run LASTZ commands
bash lastz-commands.txt
(echo "##maf version=1"; cat *.maf-) > apple_orange.maf
This runs the LASTZ commands using GNU parallel:
# run LAST commands
parallel --max-procs 16 < lastz-commands.txt
(echo "##maf version=1"; cat *.maf-) > apple_orange.maf
# check output
diff apple_orange.maf <(gzip -cdfq ./test-data/apple_orange.maf.gz)
GPU utilization can be increased by using MIG and/or MPS, leading up to 20% faster alignments.
- Preparing inputs
With the provided split_input.py script we assign individual chromosomes from the input genome into separate fasta files (up to --max_chunks), each with roughly --goal_bp number of base pairs, which will then be run in parallel on the same GPU(s). Since individual chromosomes are not split, the --goal_bp parameter should not be significantly smaller than the largest chromosome in the input file to ensure similar sized chunks. A good --goal_bp size for the human genome is 200 million base pairs.
mkdir query_split target_split
./scripts/mps-mig/split_input.py --input ./test-data/apple.fasta.gz --out query_split --to_2bit --goal_bp 20000000 --max_chunks 30
./scripts/mps-mig/split_input.py --input ./test-data/orange.fasta.gz --out target_split --to_2bit --goal_bp 20000000 --max_chunks 30
mkdir tmp
- Select GPU UUIDs to run on using
nvidia-smi -L
- run on two GPUs with 4 MPS processes per GPU (replace [GPU-UUID#] with outputs from above command)
Each KegAlign instance, with default settings, uses around 12 to 16 GiB of GPU memory. The chosen GPUs or MIG instances should each have enough GPU memory to run the number of KegAlign instances defined by the --MPS parameter.
python ./scripts/mps-mig/run_mig.py [GPU-UUID1],[GPU-UUID2] --MPS 4 --target ./target_split --query ./query_split --tmp_dir ./tmp/ --mps_pipe_dir ./tmp/ --output ./apples_oranges.maf --num_threads 64
By default the HOXD70 substitution scores are used (from Chiaromonte et al. 2002)
bad_score = X:-1000 # used for sub['X'][*] and sub[*]['X']
fill_score = -100 # used when sub[*][*] is not defined
gap_open_penalty = 400
gap_extend_penalty = 30
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
Matrix can be supplied as an input to --scoring parameter. Substitution matrix can be inferred from your data using another LASTZ-based tool (LASTZ_D: Infer substitution scores).
The default output is a MAF alignment file. Other formats can be selected with the --format parameter. See LASTZ manual for description of possible formats.
This project is released under the MIT License.
See the LICENSE file for details.
AB Gulhan, R Burhans, R Harris, M Kandemir, M Haeussler, A Nekrutenko. KegAlign: Optimizing pairwise alignments with diagonal partitioning. BIORXIV, 2024. doi: 10.1101/2024.09.02.610839