Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Scalable Statistical Methods for Ancestral Inference from Genomic Variation Data

Abstract

Developments in DNA sequencing technology over the last few years have yielded unprecedented volumes of genetic data. The resulting datasets are indispensable for a variety of purposes, from understanding cancer to answering questions about evolution. Despite the ease with which one can obtain these large quantities of data, the task of extracting meaning from the data remains an open and challenging problem. In this work, we develop statistical methods to infer population genetic parameters from high-throughput sequencing data through the use of coalescent theory, which stochastically models the evolution of DNA from generation to generation. Because closed analytic formulas are unknown for many parameters of interest, computational methods such as Markov Chain Monte Carlo and Sequential Importance Sampling become particularly relevant.

We develop a method using reversible jump MCMC to infer genome-wide variable recombination rates and apply it to data from two Drosophila melanogaster populations. Our analysis of the results reveals several interesting findings. A systematic search for hotspot regions reveals only a few occurrences along the genome, far less than that observed in human. We apply a wavelet analysis to quantify the differences between the recombination maps of the two populations, and find that although there is high variability at the fine scales, the recombination maps demonstrate general agreement at the broad scales. The correlation between various genomic features is also assessed using the wavelet analysis, and we find, in contrast to humans, a correlation between recombination and diversity.

In addition, we describe a particle filtering method to sample genealogies from the posterior distribution. Particle filtering is a model estimation technique in the family of sequential importance sampling methods. It provides the ability to perform inference on a continuous state space where the distributions under consideration are complex enough such that exact inference is intractable. The sequentially Markov coalescent, an approximation to the coalescent model where the Markov property is imposed along the sequence, is used to decompose the likelihood of the data into the product of conditional densities and allows inference on otherwise intractably long sequences of genomic data.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View