Identifying and Characterizing the Genomic Signatures of Natural Selection
- Author(s): Ronen, Roy;
- et al.
Despite being founded in the early 1920's, the field of Population and Evolutionary Genetics is currently in its second life. This is primarily driven by the recent data influx from genomic studies of ever-increasing size. The shear amount and complexity of data produced by these studies is also creating a need for improved computational techniques to be used for analysis and inference. In this thesis, I present three computational methods that are aimed at improving our understanding of genetic variation in natural populations. First, I present an algorithm for improving the accuracy of genome assemblies using the positional de Bruijn graph. I show that, using the original sequence reads in conjunction with a novel data structure, I can significantly improve the accuracy of assembled draft genomes. Necessarily, this leads to improved accuracy of all downstream inferences that use the draft as a reference, including gene discovery, transcript expression, variant calling, and many others. Second, I describe a computational framework that uses supervised learning of mutation frequency profiles to identify genomic regions impacted by positive natural selection. This is desirable, as it allows pinpointing and understanding the mechanisms responsible for adaptive traits, such as lactose tolerance in northern European populations, hypoxia tolerance in high altitude populations, and malaria resistance in African populations. Extending the widely used theoretical framework of the site frequency spectrum (SFS), I show that higher power to detect selection is achieved by training parameter-specific models of the SFS. I further show that these models can be generalized, allowing their use without prior knowledge. Last, I describe a new statistic that naturally captures many of the properties shared by haplotypes carrying an adaptive allele. I provide a theoretical model for the behavior of the statistic under different demographic and evolutionary scenarios, and validate the model using simulated data. Using this framework, I develop an algorithm that - given a region we know to be under positive selection - predicts carriers of the adaptive mutation without knowing its position. I demonstrate its high accuracy on simulated data, as well as on genetic data from well-known instances of positive selection in human populations