Inferring Selection From Limited Genetic Time-Series Data
- Li, Yunxiao
- Advisor(s): Barton, John P
Abstract
Genetic data collected over time provide an exciting opportunity to study natural selection. The study of selection for an evolving population is complicated by genetic linkage (i.e., the correlation between alleles at different locations on the genome due to shared inheritance), which entangles selection with other effects such as genetic hitchhiking, where neutral alleles rise to high frequencies together with their beneficial backgrounds, or clonal interference, where subpopulations with different beneficial alleles compete for dominance. It is thus important to account for genetic linkage to accurately measure selection in such studies. A statistical inference method, Marginal Path Likelihood (MPL), accounts for genetic linkage by modeling evolution with a Fokker–Planck equation, which, applying standard methods from statistical physics, can be converted into a path integral that quantifies the probability to generate paths of mutation frequencies. The MPL method then infers the maximum a posteriori estimation of selection strength by inverting the path integral expression. However, such inference requires a direct measure of linkage, which is generally not available in most high-throughput sequencing methods due to short read lengths. They typically provide only allele frequencies that are often sampled sparsely in time. The thesis introduces three new methods of augmenting time-series allele frequency data with additional information that can improve selection inference. Chapter 2 introduces a simple, generic method that estimates time-varying linkage information from time-series allele frequencies. This method enables the use of linkage-aware inference methods even for data sets where only allele frequency time series are available. Chapter 3 introduces a method that infers clonal structure from time-series allele frequencies. This method targets data from evolution with prominent clonal interference, and improves selection inference by recovering clonal structure which provides accurate covariance information. Chapter 4 introduces a computational method that recovers realistic dynamics in sampling intervals of time-series allele frequency data. This method targets data that has a stable clonal structure, but is sampled sparsely in time. By interpolating allele frequency and covariance trajectories to the finest temporal resolution, it further improves selection inference even when the allele frequencies are sparsely sampled in time. The three methods all aim to extract as much information as possible from limited genetic time-series data. As they make and take use of more assumptions, they become more specialized on particular types of data sets, able to alleviate influence from specific limitations in data and preserve or improve performance of selection inference.