Scallop2 is a reference-based transcript assembler specifically optimized for paired-/multiple-end RNA-seq data. The development of Scallop2 has been based on the Scallop assembler. Scripts and documentation that evaluate Scallop2 is available at scallop2-test.
Scallop2 can be easily installed with conda:
Install from source code: download the source code of Scallop2 from
here.
Scallop2 uses additional libraries of Boost and htslib.
If they have not been installed in your system, you first
need to download and install them. You might also need to
export the runtime library path to certain environmental
variable (for example, LD_LIBRARY_PATH, for most linux distributions).
After install these dependencies, you then compile the source code of Scallop2.
If some of the above dependencies are not installed to the default system
directories (for example, /usr/local, for most linux distributions),
their corresponding installing paths should be specified to configure of Scallop2.
The entire installation typically takes a few minutes to complete.
If Boost has not been downloaded/installed, download Boost (license) from (http://www.boost.org). Uncompress it somewhere (compiling and installing are not necessary).
If htslib has not been installed, download htslib (license) from (http://www.htslib.org/) with version 1.5 or higher. Note that htslib relies on zlib. So if zlib has not been installed in your system, you need to install zlib first. To do so, download zlib (license) at (https://zlib.net/). Use the following commands to install zlib:
./configure
make
make install
After installing zlib, use the following commands to build htslib:
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no
make
make install
The default installation location of htslib is /usr/lib.
If you would install it to a different location, replace the above configure line with
the following (by adding --prefix=/path/to/your/htslib to the end):
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no --prefix=/path/to/your/htslib
In this case, you also need to export the runtime library path (note that there
is an additional lib following the installation path):
export LD_LIBRARY_PATH=/path/to/your/htslib/lib:$LD_LIBRARY_PATH
Use the following to compile Scallop2:
./configure --with-htslib=/path/to/your/htslib --with-boost=/path/to/your/boost
make
If some of the dependencies are installed in the default system directory (for example, /usr/lib),
then the corresponding --with- option might not be necessary.
The executable file scallop2 will appear at src/scallop2.
The usage of scallop2 is:
./scallop2 -i <input.bam> -o <output.gtf> [options]
The input.bam is the read alignment file generated by some RNA-seq aligner, (for example, STAR or HISAT2).
Make sure that it is sorted; otherwise run samtools to sort it:
samtools sort input.bam > input.sort.bam
The reconstructed transcripts shall be written as gtf format into output.gtf.
Scallop2 support the following parameters. Please refer to additional explanations below the table.
| Parameters | Default Value | Description |
|---|---|---|
| --help | print usage of Scallop2 and exit | |
| --version | print version of Scallop2 and exit | |
| --preview | show the inferred library_type and exit |
|
| --verbose | 1 | chosen from {0, 1, 2} |
| -f/--transcript_fragments | file to which the assembled non-full-length transcripts will be written to | |
| --library_type | empty | chosen from {empty, unstranded, first, second} |
| --assemble_duplicates | 10 | the number of consensus runs of the decomposition |
| --min_transcript_coverage | 1.5 | the minimum coverage required to output a multi-exon transcript |
| --min_single_exon_coverage | 20 | the minimum coverage required to output a single-exon transcript |
| --min_transcript_length_base | 150 | the minimum base length of a transcript |
| --min_transcript_length_increase | 50 | the minimum increased length of a transcript with each additional exon |
| --min_mapping_quality | 1 | ignore reads with mapping quality less than this value |
| --max_num_cigar | 1000 | ignore reads with CIGAR size larger than this value |
| --min_bundle_gap | 100 | the minimum distances required to start a new bundle |
| --min_num_hits_in_bundle | 5 | the minimum number of reads required in a bundle |
| --min_flank_length | 3 | the minimum match length required in each side for a spliced read |
| --min_splice_bundary_hits | 1 | the minimum number of spliced reads required to support a junction |
-
For
--verbose, 0: quiet; 1: one line for each splice graph; 2: details of graph decomposition. -
--library_typeis highly recommended to provide. Theunstranded,first, andsecondcorrespond tofr-unstranded,fr-firststrand, andfr-secondstrandused in standard Illumina sequencing libraries. If none of them is given, i.e., it isemptyby default, then Scallop2 will try to infer thelibrary_typeby itself (see--preview). Notice that such inference is based on theXStag stored in the inputbamfile. If the inputbamfile do not containXStag, then it is essential to provide thelibrary_typeto Scallop2. You can try--previewto see the inferredlibrary_type. -
--min_transcript_coverageis used to filter lowly expressed transcripts: Scallop2 will filter out transcripts whose (predicted) raw counts (number of moleculars) is less than this number. -
--min_transcript_length_baseand--min_transcript_length_increaseis combined to filter short transcripts: the minimum length of a transcript is given by--min_transcript_length_base+--min_transcript_length_increase* num-of-exons-in-this-transcript. Transcripts that are less than this number will be filtered out.
The alignments of a set of 10 Illumina paired-end RNA-seq samples are available at doi:10.26208/8c06-w247. Scallop2 usually take a few minteus to an hour to assemble a typical aligned Illumina RNA-seq sample.
A small example of input data example-input.bam is available in the example directory.
Suppose we have installed Scallop2 following the steps in the Installation section, we have the executable file scallop2 at src/scallop2.
Commands to enter example directory and run Scallop2 using example-input.bam as input:
cd ./example
../src/scallop2 -i example-input.bam -o example-output.gtf
An output file named example-output.gtf will appear in the example directory.
The output file stores the reconstructed transcripts assembled by Scallop2 in GTF format.
The Gene transfer format (GTF) is a file format used to hold information about gene structure, detailed documentation about GTF format is available from Ensembl.
We can use IGV to visualize the input reads (BAM file) and the output assembled transcripts (GTF file).
A snapshot of IGV visualization of example-input.bam and example-output.gtf at gene loci chr1:46,303,377-46,321,251 is avaliable at example/example-IGV-snapshot.png.