 Main
Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data
 Author(s): Bhaskar, Anand
 Advisor(s): Song, Yun S
 et al.
Abstract
The recent availability of largesample highthroughput sequencing data has given us an unprecedented opportunity to very finely resolve the details of historical demographic processes that have shaped the genomes of modern human populations. Such understanding of population demography is important for several applications — to avoid false positives in genomewide association studies; to calibrate null models of neutral genome evolution in order to find regions under selection; to study the impact of bottlenecks and small founder populations on genetic mutational load; to reconstruct largescale historical human migration and admixture events; and so on.
In this dissertation, we consider some statistical, algorithmic and robustness aspects of demographic inference from genomic variation data. In particular, we study the problem of determining the historical effective size of a population from the sample frequency spectrum (SFS), which measures the distribution of allele frequencies in a sample of sequences drawn from the population.
From the statistical or informationtheoretic perspective, it is known that this inverse problem does not have a unique solution in general, no matter how large the sample size. For any population allele frequency distribution, there exist infinitely many population size functions that are consistent with this distribution. While such a nonidentifiability result might appear to pose a serious problem to statistical inference algorithms, we show that the situation is not so bad in practice. In particular, we prove that if the true population size function is piecewisedefined with each piece belonging to some family of biologicallymotivated functions, then the SFS of a finite sample of sequences uniquely determines the underlying demography. We obtain a general bound on the sample size sufficient for identifiability; this bound depends on the number of pieces in the demographic model and on the family of functions for each piece. We also give concrete instantiations of this bound for piecewiseconstant and piecewiseexponential models that are commonly used in demographic inference analyses.
From the algorithmic perspective, we build on analytic results for the expected SFS of a timevarying population size function and develop an efficient likelihoodbased algorithm to infer piecewiseexponential
population size histories from large sample allele frequency data. By considering very large samples, our method can resolve details of the population history from the very recent past that are not otherwise accessible using smaller samples.
The third aspect of this dissertation is concerned with understanding the robustness of widely used evolutionary models to violations of model assumptions. Continuoustime evolutionary models like Kingman's coalescent and its dual diffusion process are derived from discrete models of random mating by assuming that the sample size being analyzed is much smaller than the the population size. However, the very large sample datasets being produced due to advances in highthroughput sequencing technologies are approaching the limits of this assumption. To investigate this issue, we develop exact algorithms for computation under the discretetime WrightFisher model and use these algorithms to study the distortions in several genealogical quantities arising due to the coalescent approximation. Our findings indicate that for several demographic models inferred from largescale sequence data, there can be substantial genealogical deviations introduced by the coalescent approximation that might influence the results of inference studies.
Main Content
Enter the password to open this PDF file:













