Computational Genetic Approaches for Understanding the Genetic Basis of Complex Traits
- Author(s): Kang, Eun Yong
- Advisor(s): Eskin, Eleazar
- et al.
Recent advances in genotyping and sequencing technology have enabled researchers to collect an enormous amount of high-dimensional genotype data. These large scale genomic data provide unprecedented opportunity for researchers to study and analyze the genetic factors of human complex traits. One of the major challenges in analyzing these high-throughput genomic data is requirements for effective and efficient computational methodologies. In this thesis, I introduce several methodologies for analyzing these genomic data which facilitates our understanding of the genetic basis of complex human traits. First, I introduce a method for inferring biological networks from high-throughput data containing both genetic variation information and gene expression profiles from genetically distinct strains of an organism. For this problem, I use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. In particular, I utilize prior biological knowledge that genetic variations affect gene expressions, but not vice versa, which allow us to direct the subsequent edges between two gene expression levels. The prediction of a presence of causal relationship as well as the absence of causal relationship between gene expressions can facilitate distinguishing between direct and indirect effects of variation on gene expression levels. I demonstrate the utility of our approach by applying it to data set containing 112 yeast strains and the proposed method identifies the known "regulatory hotspot" in yeast. Second, I introduce efficient pairwise identity by descent (IBD) association mapping method, which utilizes importance sampling to improve efficiency and enables approximation of extremely small p-values. Two individuals are IBD at a locus if they have identical alleles inherited from a common ancestor. One popular approach to find the association between IBD status and disease phenotype is the pairwise method where one compares the IBD rate of case/case pairs to the background IBD rate to detect excessive IBD sharing between cases. One challenge of the pairwise method is computational efficiency. In the pairwise method, one uses permutation to approximate p-values because it is difficult to analytically obtain the asymptotic distribution of the statistic. Since the p-value threshold for genome-wide association studies (GWAS) is necessarily low due to multiple testing, one must perform a large number of permutations which can be computationally demanding. I present Fast-Pairwise to overcome the computational challenges of the traditional pairwise method by utilizing importance sampling to improve efficiency and enable approximation of extremely small p-values. Using the WTCCC type 1 diabetes data, I show that Fast-Pairwise can successfully pinpoint a gene known to be associated to the disease within the MHC region. Finally, I introduce a novel meta analytic approach to identify gene-by-environment interactions by aggregating the multiple studies with varying environmental conditions. Identifying environmentally specific genetic effects is a key challenge in understanding the structure of complex traits. Model organisms play a crucial role in the identification of such gene-by-environment interactions, as a result of the unique ability to observe genetically similar individuals across multiple distinct environments. Many model organism studies examine the same traits but, under varying environmental conditions. These studies when examined in aggregate provide an opportunity to identify genomic loci exhibiting environmentally-dependent effects. In this project, I jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. I apply our new method to combine 17 mouse studies containing in aggregate 4,965 distinct animals. We identify 26 significant loci involved in High-density lipoprotein (HDL) cholesterol, many of which show significant evidence of involvement in gene-by-environment interactions.