Recent advances of DNA sequencing technologies are allowing
researchers to generate
immense amounts of data in a fast and cost effective fashion, enabling
genome-wide association study and population
genetic research which could not be done a decade ago.
There are quite numerous computational challenges arising from this
advancement, however. Examples include efficient algorithms for processing raw
data generated by sequencing instruments, algorithms for detecting and
correcting sequencing errors, algorithms for detecting genome
variations from sequence data, just to name a few. Because different
sequencing technologies can have drastically different
characteristics, these algorithms often need to be adapted in order
to produce most accurate results.
In this thesis, I will address a few of the aforementioned problems. First, I
will describe two model-based basecalling algorithms for the Illumina
sequencing platforms: BayesCall and naiveBayesCall. The novelty of BayesCall algorithm
is that it is fully unsupervised, requiring no
training data with known labels, and therefore it is
applicable to data without a reference sequence.
It also dramatically improves sequencing accuracies.
Built upon BayesCall algorithm, naiveBayesCall dramatically improves computational
efficiency by approximating the original model without sacrificing
accuracy. We will also show that improved basecall can have positive
effects on the downstream sequence analysis, such as the detection of
single nucleotide polymorphism and the assembly of novel genomes.
In the third chapter, an algorithm, called ECHO, for correcting short-read
sequencing errors will be described. The correction algorithm efficiently computes
all overlaps between sequencing reads and corrects errors
by using statistical models. Since it does not rely on reference
genomes, ECHO can also be applied to de novo sequencing.
Most other error correction algorithms require users to specify
key parameters, but the optimal values for these parameters are unknown to
users and can be hard to specify. Without key parameters being
optimized, the effectiveness of error correction algorithm could
sometimes be dramatically reduced.
Based on statistical models, ECHO optimizes these parameters
accordingly. We will show that ECHO can significantly reduce
sequence error rates and also facilitate downstream sequence
analysis. It is also demonstrated that ECHO can be extended to detect
heterozygousity from sequencing data.
These algorithms are developed in hopes to
make downstream analysis of sequence data easier and ultimately
facilitate genome researches.