Skip to main content
Open Access Publications from the University of California

Evaluation And Application Of The Three A's Of Genomics: Assembly, Alignment, Annotation.

  • Author(s): Earl, Dent A.
  • Advisor(s): Haussler, David
  • et al.
Creative Commons 'BY' version 4.0 license

In the fourteen years since the announcement of the rough draft of the human genome there has been a precipitous drop in the cost of genome sequencing, from several billions of dollars down to a few thousand. This has lead to a new age of data prosperity in biological sciences and to a large growth in the field of Bioinformatics. The extremely low cost of sequencing means that multitudes of species can now be sequenced and those data can be added to the ever-expanding library. This sudden wealth of raw data at a low level in the analysis hierarchy has given an urgency to important methodological questions. What is the best method to assemble a new genome? Given that you have several assembled genomes and you want to know how they're related to one another, what is the best way to align these genomes? And finally, once in possession of an alignment of genomes, how can we best use that information to annotate genes and other biologically interesting regions upon those genomes?

The work presented herein is an attempt to quantify and compare the methods for carrying out these fundamental research tasks. In the first chapter (on the Assem- blathon, a collaborative competition that brought together the world's genome assembly expects), I investigate the problem of short read whole genome de novo assembly. In the second chapter (on the Alignathon, a collaborative competition that brought together the world's whole genome alignment experts) I investigate the problem of whole genome alignment. Finally I address the problem of comparative annotation by using a dataset comprised of the genomes of 17 strains of inbred mouse and the rat in order to establish a quick method of annotating multiple genomes using a combination of evolutionary homology information, existing annotations and extrinsic data (e.g. RNAseq) using off the shelf software components for the majority of the work.

Main Content
Current View