PacBio offers a suite of tools to analyze 3rd generation sequencing data including reference-based alignment, de novo assembly, hybrid de novo assembly, and consensus calling with variant detection. For each of these applications, we have created algorithms that match the speed inherent in SMRT sequencing and take advantage of PacBio's long reads.
BLASR
BLASR (Basic Local Alignment with Successive Refinement) maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome. The initial set of candidate alignments is found by querying a rapidly searched pre-computed index of the reference genome, and then refined until only high scoring alignments are retained. The base assignment in alignments is optimized and scored using all available quality information, such as insertion and deletion quality values. Because alignment approximates an exhaustive search, alignment significance may be computed by comparing optimal alignment score to the distribution of all other significant alignment scores.
ALLORA
ALLORA (A Long Read Assembler) is the PacBio de novo assembly algorithm. It is based on the open source assembly software package AMOS along with additional software components tailored to PacBio's long reads and error profile. Allora uses a traditional overlap-layout-consensus approach to iteratively assemble PacBio raw reads into contigs, outputting these contigs as FASTA sequence and cmp.h5 files.
EviCons
The goal of EviCons (Evidence-based Consensus) is to produce the consensus sequence from a multiple sequence alignment corresponding to mapped reads (resequencing) or a contig (de novo). Using empirical conditional probabilities and a likelihood ratio test, EviCons demarcates the multiple sequence alignment into regions of certainty and regions of uncertainty. For regions of uncertainty, EviCons uses base quality values and the Steiner framework to produce the best estimate of the local consensus sequence.
RCCS
The RCCS (Reference Circular Consensus Sequencing) module is designed to call SNP and small indel variants against a reference sequence from the circular subreads for each single molecule. It uses a probability alignment algorithm testing all possible single base variants at a given location and determines the correct one using the likelihood calculated with the alignment model.
AHA
The PacBio hybrid assembly pipeline AHA (A Hybrid Assembler) combines PacBio sequence with data from other high-confidence sources. These data can be assembled contigs from other next generation sequencing sources or even Sanger contigs. The PacBio data can be standard long reads or strobe reads. The pipeline uses the PacBio reads to orient the high-confidence contigs, joining them into larger contigs or scaffolds.
More information and video explanations of each algorithm will be made available on DevNet.