Search

Scholarly Works (12 results)

Sort By:

Show:

Thesis
Peer Reviewed

Design of efficient and accurate statistical approaches to correct for confounding effects and identify true signals in genetic association studies

JOO, JONG WHA JOANNE
Advisor(s): Eskin, Eleazar

UCLA Electronic Theses and Dissertations (2015)

Over the past decades, genome-wide association studies have dramatically improved especially with the advent of the hight-throughput technologies such as microarray and next generation sequencing. Although genome-wide association studies have been extremely successful in identifying tens of thousands of variants associated with various disease or traits, many studies have reported that some of the associations are spurious induced by various confounding factors such as population structure or technical artifacts. In this dissertation, I focus on effectively and accurately identifying true signals in genome-wide association studies in the presence of confounding effects. First, I introduce a method that effectively identifying regulatory hotspots while correcting for false signals induced by technical confounding effects in expression quantitative loci studies. Technical confounding factors such as a batch effect complicates the expression quantitative loci analysis by inducing heterogeneity in gene expressions. This creates correlations between the samples and may cause spurious associations leading to spurious regulatory hotspots. By formulating the problem of identifying genetic signals in a linear mixed model framework, I show how we can identify regulatory hotspots while capturing heterogeneity in expression quantitative loci studies. Second, I introduce an efficient and accurate multiple-phenotype analysis method for high-dimensional data in the presence of population structure. Recently, large amounts of genomic data such as expression data have been collected from genome-wide association studies cohorts and in many cases it is preferable to analyze more than thousands of phenotypes simultaneously than analyze each phenotype one at a time. However, when confounding factors, such as population structure, exit in the data, even a small bias is induced by the confounding effects, the bias accumulates for each phenotype and may cause serious problems in multiple-phenotype analysis. By incorporating linear mixed model in the statistics of multivariate regression, I show we can increase the accuracy of multiple phenotype analysis dramatically in high- dimensional data. Lastly, I introduce an efficient multiple testing correction method in linear mixed model. The significance threshold differs as a function of species, marker densities, genetic relatedness, and trait heritability. However, none of the previous multiple testing correction methods can comprehensively account for these factors. I show that the significant threshold changes with the dosage of genetic relatedness and introduce a novel multiple testing correction approach that utilizes linear mixed model to account for the confounding effects in the data.

Cover page: Design of efficient and accurate statistical approaches to correct for confounding effects and identify true signals in genetic association studies

Article
Peer Reviewed

Multiple testing correction in linear mixed models

UCLA Previously Published Works (2016)

Background

Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM.

Results

We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach.

Conclusions

We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data.

Cover page: Multiple testing correction in linear mixed models

Article
Peer Reviewed

Privacy preserving protocol for detecting genetic relatives using rare variants

UCLA Previously Published Works (2014)

Motivation

High-throughput sequencing technologies have impacted many areas of genetic research. One such area is the identification of relatives from genetic data. The standard approach for the identification of genetic relatives collects the genomic data of all individuals and stores it in a database. Then, each pair of individuals is compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test.

Results

In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provide the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins which was not possible using the existing methods. We also show in the 1000 genomes data with cryptic relationships that our method can detect these individuals.

Availability

The software is freely available for download at http://genetics.cs.ucla.edu/crypto/.

Cover page: Privacy preserving protocol for detecting genetic relatives using rare variants

Article
Peer Reviewed

Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies

UC San Francisco Previously Published Works (2014)

Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.

Cover page: Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies

Article
Peer Reviewed

An Association Mapping Framework To Account for Potential Sex Difference in Genetic Architectures

UCLA Previously Published Works (2018)

Over the past few years, genome-wide association studies have identified many trait-associated loci that have different effects on females and males, which increased attention to the genetic architecture differences between the sexes. The between-sex differences in genetic architectures can cause a variety of phenomena such as differences in the effect sizes at trait-associated loci, differences in the magnitudes of polygenic background effects, and differences in the phenotypic variances. However, current association testing approaches for dealing with sex, such as including sex as a covariate, cannot fully account for these phenomena and can be suboptimal in statistical power. We present a novel association mapping framework, MetaSex, that can comprehensively account for the genetic architecture differences between the sexes. Through simulations and applications to real data, we show that our framework has superior performance than previous approaches in association mapping.

Cover page: An Association Mapping Framework To Account for Potential Sex Difference in Genetic Architectures

Article
Peer Reviewed

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure

UCLA Previously Published Works (2016)

A typical genome-wide association study tests correlation between a single phenotype and each genotype one at a time. However, single-phenotype analysis might miss unmeasured aspects of complex biological networks. Analyzing many phenotypes simultaneously may increase the power to capture these unmeasured aspects and detect more variants. Several multivariate approaches aim to detect variants related to more than one phenotype, but these current approaches do not consider the effects of population structure. As a result, these approaches may result in a significant amount of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA for generalized analysis of molecular variance for mixed-model analysis, which is capable of simultaneously analyzing many phenotypes and correcting for population structure. In a simulated study using data implanted with true genetic effects, GAMMA accurately identifies these true effects without producing false positives induced by population structure. In simulations with this data, GAMMA is an improvement over other methods which either fail to detect true effects or produce many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mice and show that GAMMA identifies several variants that are likely to have true biological mechanisms.

Cover page: Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure

Article
Peer Reviewed

Identifying genetic relatives without compromising privacy

UCLA Previously Published Works (2014)

The development of high-throughput genomic technologies has impacted many areas of genetic research. While many applications of these technologies focus on the discovery of genes involved in disease from population samples, applications of genomic technologies to an individual's genome or personal genomics have recently gained much interest. One such application is the identification of relatives from genetic data. In this application, genetic information from a set of individuals is collected in a database, and each pair of individuals is compared in order to identify genetic relatives. An inherent issue that arises in the identification of relatives is privacy. In this article, we propose a method for identifying genetic relatives without compromising privacy by taking advantage of novel cryptographic techniques customized for secure and private comparison of genetic information. We demonstrate the utility of these techniques by allowing a pair of individuals to discover whether or not they are related without compromising their genetic information or revealing it to a third party. The idea is that individuals only share enough special-purpose cryptographically protected information with each other to identify whether or not they are relatives, but not enough to expose any information about their genomes. We show in HapMap and 1000 Genomes data that our method can recover first- and second-order genetic relationships and, through simulations, show that our method can identify relationships as distant as third cousins while preserving privacy.

Cover page: Identifying genetic relatives without compromising privacy

Article
Peer Reviewed

Widespread Allelic Heterogeneity in Complex Traits.

UCLA Previously Published Works (2017)

Recent successes in genome-wide association studies (GWASs) make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH) arising from multiple causal variants at a locus. We developed a computational method to infer the probability of AH and applied it to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of AH. The proportion of all loci with identified AH is 4%-23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH (R² = 0.85, p = 2.2 × 10^-16), indicating that statistical power prevents identification of AH in other loci. Understanding the extent of AH may guide the development of new methods for fine mapping and association mapping of complex traits.

Cover page: Widespread Allelic Heterogeneity in Complex Traits.

Article
Peer Reviewed

Genome-Wide Association Study for Age-Related Hearing Loss (AHL) in the Mouse: A Meta-Analysis

UCLA Previously Published Works (2014)

Age-related hearing loss (AHL) is characterized by a symmetric sensorineural hearing loss primarily in high frequencies and individuals have different levels of susceptibility to AHL. Heritability studies have shown that the sources of this variance are both genetic and environmental, with approximately half of the variance attributable to hereditary factors as reported by Huag and Tang (Eur Arch Otorhinolaryngol 267(8):1179-1191, 2010). Only a limited number of large-scale association studies for AHL have been undertaken in humans, to date. An alternate and complementary approach to these human studies is through the use of mouse models. Advantages of mouse models include that the environment can be more carefully controlled, measurements can be replicated in genetically identical animals, and the proportion of the variability explained by genetic variation is increased. Complex traits in mouse strains have been shown to have higher heritability and genetic loci often have stronger effects on the trait compared to humans. Motivated by these advantages, we have performed the first genome-wide association study of its kind in the mouse by combining several data sets in a meta-analysis to identify loci associated with age-related hearing loss. We identified five genome-wide significant loci (<10(-6)). One of these loci confirmed a previously identified locus (ahl8) on distal chromosome 11 and greatly narrowed the candidate region. Specifically, the most significant associated SNP is located 450 kb upstream of Fscn2. These data confirm the utility of this approach and provide new high-resolution mapping information about variation within the mouse genome associated with hearing loss.

Cover page: Genome-Wide Association Study for Age-Related Hearing Loss (AHL) in the Mouse: A Meta-Analysis

Article
Peer Reviewed

Colocalization of GWAS and eQTL Signals Detects Target Genes.

UCLA Previously Published Works (2016)

The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci.

Cover page: Colocalization of GWAS and eQTL Signals Detects Target Genes.