Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Statistical and Algorithmic Methods to Analyze Genome Sequencing Data

Abstract

With continuing reductions in the cost of genome sequencing, and the advent of new sequencing technologies, it has become a routine process to sequence genomes in experiments across different fields of biology in order to study the foundations of life. Collecting multi-omics data is now an integral part of studying the human health and underlying genetic cause of diseases. Genome sequencing is extensively used to study the evolution of life and how different species are genetically related, and it is becoming an important tool to monitor the health of ecosystems and study the dynamics of biodiversity in this era of rapid climate change. As large datasets of genomic data become available through worldwide consortia and collaborative efforts, an important challenge is processing and interpreting these massive datasets. In this dissertation, I present a collection of statistical and algorithmic methods to address different computational problems faced in using genome sequencing data to study the function and properties of genome and its variation across species. In the first part of this dissertation, I describe the method we developed to address the problem of statistical significance of overlap between genome annotations--the assignment of function to specific genomic regions, which is a foundational effort of modern biology. To the best of our knowledge, the p-value computation for sets of overlapping intervals has been limited either to permutation tests that do not scale to computation of small p-values or simple parametric tests such as hypergeometric or binomial tests that are based on simplifying assumptions about the length and structure of intervals. Our method, however, formulates a null model where the size of intervals and their relative arrangement are considered when the significance of overlap is evaluated. In the second part, I introduce the idea of using whole genome sequencing reads at low coverage--genome skims--without requiring any genome assembly or alignment. We have developed methods to compute genomic distances between genome skims to use them for sample identification and phylogenetic placement, and to estimate genomic parameters such as genome length and repeat content of the genome to lay the foundation for accurate assessment of genetic biodiversity.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View