Efficient Probabilistic Model Based Approaches for Analysis of Human Genomic Data
- Author(s): Yang, Wenyun
- Advisor(s): Eskin, Eleazar
- et al.
The advent of genotyping and sequencing technologies has enabled human genetics to discover numerous genetic variants and perform analysis in the level of populations. Understanding the genetic diversity of populations has broad applications in studies of human disease, history, and the relationships within and among populations. I propose a new approach, spatial ancestry analysis, for the modeling of genotypes in two and three dimensional space. I show that the explicit modeling of the allele frequency allows us to localize individuals on the geographical map based on their genetic information alone. Furthermore, a direct probabilistic interpretation of our model enables us to accurately predict geographical origins of an individual even when the individual has mixed ancestry. In addition, the analysis also identifies additional genes, e.g., FOXP2, OCA2 and LRP1B, that have extreme allele frequency gradients that may have been due to selection.
I therefore generalize the spatial ancestry analysis based on hidden Markov models of admixture along with a model of spatial distribution of variants to infer the location of the ancestors jointly with assigning ancestry at each locus in the genome of admixed individuals. This generalized approach is able to localize their recent ancestors with an average of 470Km of the reported locations of their grandparents, for mixed European ancestries.
I propose a novel framework for haplotype inference from short read sequencing that leverages multi-SNP reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient.
Finally, I introduce a novel spatial-aware haplotype copying model, which assumes that any chromosome can be modeled as a mosaic of segments copied from a set of sampled chromosomes, but chromosomes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. This model has various potential applications. In particular, I show that this model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, I also show the utility of this model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach.