Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Algorithms for Next-Generation High-Throughput Sequencing Technologies

Abstract

Recent advances of DNA sequencing technologies are allowing

researchers to generate

immense amounts of data in a fast and cost effective fashion, enabling

genome-wide association study and population

genetic research which could not be done a decade ago.

There are quite numerous computational challenges arising from this

advancement, however. Examples include efficient algorithms for processing raw

data generated by sequencing instruments, algorithms for detecting and

correcting sequencing errors, algorithms for detecting genome

variations from sequence data, just to name a few. Because different

sequencing technologies can have drastically different

characteristics, these algorithms often need to be adapted in order

to produce most accurate results.

In this thesis, I will address a few of the aforementioned problems. First, I

will describe two model-based basecalling algorithms for the Illumina

sequencing platforms: BayesCall and naiveBayesCall. The novelty of BayesCall algorithm

is that it is fully unsupervised, requiring no

training data with known labels, and therefore it is

applicable to data without a reference sequence.

It also dramatically improves sequencing accuracies.

Built upon BayesCall algorithm, naiveBayesCall dramatically improves computational

efficiency by approximating the original model without sacrificing

accuracy. We will also show that improved basecall can have positive

effects on the downstream sequence analysis, such as the detection of

single nucleotide polymorphism and the assembly of novel genomes.

In the third chapter, an algorithm, called ECHO, for correcting short-read

sequencing errors will be described. The correction algorithm efficiently computes

all overlaps between sequencing reads and corrects errors

by using statistical models. Since it does not rely on reference

genomes, ECHO can also be applied to de novo sequencing.

Most other error correction algorithms require users to specify

key parameters, but the optimal values for these parameters are unknown to

users and can be hard to specify. Without key parameters being

optimized, the effectiveness of error correction algorithm could

sometimes be dramatically reduced.

Based on statistical models, ECHO optimizes these parameters

accordingly. We will show that ECHO can significantly reduce

sequence error rates and also facilitate downstream sequence

analysis. It is also demonstrated that ECHO can be extended to detect

heterozygousity from sequencing data.

These algorithms are developed in hopes to

make downstream analysis of sequence data easier and ultimately

facilitate genome researches.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View