Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Scalable and Robust Statistical Inference Algorithms for Linking Genotypes to Phenotypes

Abstract

With the advancements in DNA sequencing technology and the decreasing cost of sequencing, there has been exponential growth in the amount of genomic data generated. This growth provides an unprecedented opportunity to access the genotypes of a large population, including millions of genetic variants, and to collect hundreds of thousands of phenotypic measurements from the same individuals. This opens doors to systematically studying the genetic architecture underlying complex traits and diseases. Genetic architecture refers broadly to a complete understanding of all genetic contributions to a given trait as well as to an awareness of the characteristics of this contribution.

During the past decade, variance components analysis has emerged as a robust statistical framework for investigating the genetic architectures of complex traits. To gain accurate and innovative insights into genetic architecture, applying flexible variance component models to large-scale datasets is crucial. However, fitting such models necessitates the use of scalable algorithms. Common approaches for estimating variance components involve searching for parameter values that maximize the likelihood or the restricted maximum likelihood (REML). Despite several algorithmic advancements, computing REML estimates of variance components on extensive datasets like the UK Biobank, which consists of approximately 500,000 genotyped individuals, millions of single nucleotide polymorphisms (SNPs), and hundreds of thousands of phenotypes, remains challenging. This thesis introduces a set of scalable and robust statistical inference algorithms rooted in variance component analysis. These algorithms are designed to estimate the variation in a trait that can be explained by linear and non-linear functions of the genotype, such as the interaction between alleles at a single genetic variant (dominance), the interaction between genetic variants (epistasis), and the interaction between environmental factors and genetic variants (GxE). Furthermore, these algorithms aim to estimate the distribution of these effects across the genome.

By applying our methods to the UK Biobank dataset, we uncover valuable insights into the genetic architecture of complex traits. Notable observations are as follows. First, we observe that both per-allele squared additive and GxE effect size increase with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD). Second, testing whether GxE heritability is enriched around genes that are highly expressed in specific tissues, we find significant tissue-specific enrichments that include brain-specific enrichment for BMI and Basal Metabolic Rate in the context of smoking, adipose-specific enrichment for WHR in the context of sex, and cardiovascular tissue-specific enrichment for total cholesterol in the context of age. Third, we detect epistasis effects between SNPs located on the same chromosome and between SNPs located on different chromosomes. Fourth, our analyses indicate a limited contribution of dominance heritability to complex trait variation.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View