Search

Scholarly Works (27 results)

Sort By:

Show:

Thesis
Peer Reviewed

Effective design and analysis of genetic association studies

Han, Buhm

UC San Diego Electronic Theses and Dissertations (2009)

Genetic association studies are an effective means of discovering associations between genetic variants and diseases. The procedure of association studies can be summarized into four stages of design, sample collection, analysis, and follow-up. There exist many statistical and computational challenges in the design and analysis stages of these studies. These challenges are closely related to exploring the correlation structure of genetic variations in the genome called linkage disequilibrium (LD). In this dissertation, I address some of these challenges and propose solutions which effectively leverage the information in LD patterns. Multiple hypothesis testing correction is the major challenge in the analysis stage. It is difficult to assess the statistical significance of associations in association studies because a large number of correlated tests are simultaneously performed. Previous approaches are either inaccurate or prohibitively inefficient. I propose a novel multiple testing correction method which takes advantage of the local LD patterns by using a sliding-window approach. My method is highly accurate and efficient, effectively replacing the current approaches. Estimating statistical power of a study design is a necessary procedure in the design stage to avoid under- or over-powered study. Current approaches are either inefficient or too conservative because they ignore the correlation between tests. I propose a method which takes into account the LD patterns to estimate statistical power of a study design efficiently and accurately. Tag SNP selection problem is a widely-known challenge in the design stage. I propose a power-based tag SNP selection algorithm which greedily chooses SNPs to maximize the study power. My method outperforms other correlation only- based methods, because I take advantage of the relation between LD and power by accounting for allele frequencies. In the analysis stage, detecting spurious associations is a challenging problem. I propose a novel method which detects spurious associations at the post-association stage using the LD information. Moreover, I extend this framework to propose a new study scheme which "rescues" associations at markers that are excluded by quality controls. My method is applied to the WTCCC dataset to identify a novel association which is recently replicated

Cover page: Effective design and analysis of genetic association studies

Article
Peer Reviewed

Interpreting Meta-Analyses of Genome-Wide Association Studies

UCLA Previously Published Works (2012)

Meta-analysis is an increasingly popular tool for combining multiple genome-wide association studies in a single analysis to identify associations with small effect sizes. The effect sizes between studies in a meta-analysis may differ and these differences, or heterogeneity, can be caused by many factors. If heterogeneity is observed in the results of a meta-analysis, interpreting the cause of heterogeneity is important because the correct interpretation can lead to a better understanding of the disease and a more effective design of a replication study. However, interpreting heterogeneous results is difficult. The standard approach of examining the association p-values of the studies does not effectively predict if the effect exists in each study. In this paper, we propose a framework facilitating the interpretation of the results of a meta-analysis. Our framework is based on a new statistic representing the posterior probability that the effect exists in each study, which is estimated utilizing cross-study information. Simulations and application to the real data show that our framework can effectively segregate the studies predicted to have an effect, the studies predicted to not have an effect, and the ambiguous studies that are underpowered. In addition to helping interpretation, the new framework also allows us to develop a new association testing procedure taking into account the existence of effect.

Cover page: Interpreting Meta-Analyses of Genome-Wide Association Studies

Article
Peer Reviewed

PLEIO: a method to map and interpret pleiotropic loci with GWAS summary statistics.

UCLA Previously Published Works (2021)

Identifying and interpreting pleiotropic loci is essential to understanding the shared etiology among diseases and complex traits. A common approach to mapping pleiotropic loci is to meta-analyze GWAS summary statistics across multiple traits. However, this strategy does not account for the complex genetic architectures of traits, such as genetic correlations and heritabilities. Furthermore, the interpretation is challenging because phenotypes often have different characteristics and units. We propose PLEIO (Pleiotropic Locus Exploration and Interpretation using Optimal test), a summary-statistic-based framework to map and interpret pleiotropic loci in a joint analysis of multiple diseases and complex traits. Our method maximizes power by systematically accounting for genetic correlations and heritabilities of the traits in the association test. Any set of related phenotypes, binary or quantitative traits with different units, can be combined seamlessly. In addition, our framework offers interpretation and visualization tools to help downstream analyses. Using our method, we combined 18 traits related to cardiovascular disease and identified 13 pleiotropic loci, which showed four different patterns of associations.

Cover page: PLEIO: a method to map and interpret pleiotropic loci with GWAS summary statistics.

Article
Peer Reviewed

IPED: Inheritance Path-based Pedigree Reconstruction Algorithm Using Genotype Data

UCLA Previously Published Works (2013)

The problem of inference of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. Current methods, unfortunately, are very time-consuming and inaccurate for complicated pedigrees, such as pedigrees with inbreeding. In this work, we propose an efficient algorithm that is able to reconstruct large pedigrees with reasonable accuracy. Our algorithm reconstructs the pedigrees generation by generation, backward in time from the extant generation. We predict the relationships between individuals in the same generation using an inheritance path-based approach implemented with an efficient dynamic programming algorithm. Experiments show that our algorithm runs in linear time with respect to the number of reconstructed generations, and therefore, it can reconstruct pedigrees that have a large number of generations. Indeed it is the first practical method for reconstruction of large pedigrees from genotype data.

Article
Peer Reviewed

Multiple testing correction in linear mixed models

UCLA Previously Published Works (2016)

Background

Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM.

Results

We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach.

Conclusions

We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data.

Cover page: Multiple testing correction in linear mixed models

Article
Peer Reviewed

Incorporating prior information into association studies

UCLA Previously Published Works (2012)

Unlabelled

Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power.

Availability

The method presented herein is available at http://masa.cs.ucla.edu.

Cover page: Incorporating prior information into association studies

Article
Peer Reviewed

Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers

UCLA Previously Published Works (2009)

With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies--SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu.

Cover page: Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers

Article
Peer Reviewed

Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies

UC San Francisco Previously Published Works (2014)

Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.

Cover page: Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies

Article
Peer Reviewed

ForestPMPlot: A Flexible Tool for Visualizing Heterogeneity Between Studies in Meta-analysis

UCLA Previously Published Works (2016)

Meta-analysis has become a popular tool for genetic association studies to combine different genetic studies. A key challenge in meta-analysis is heterogeneity, or the differences in effect sizes between studies. Heterogeneity complicates the interpretation of meta-analyses. In this paper, we describe ForestPMPlot, a flexible visualization tool for analyzing studies included in a meta-analysis. The main feature of the tool is visualizing the differences in the effect sizes of the studies to understand why the studies exhibit heterogeneity for a particular phenotype and locus pair under different conditions. We show the application of this tool to interpret a meta-analysis of 17 mouse studies, and to interpret a multi-tissue eQTL study.

Cover page: ForestPMPlot: A Flexible Tool for Visualizing Heterogeneity Between Studies in Meta-analysis

Article
Peer Reviewed

Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches

UCLA Previously Published Works (2013)

Gene expression data, in conjunction with information on genetic variants, have enabled studies to identify expression quantitative trait loci (eQTLs) or polymorphic locations in the genome that are associated with expression levels. Moreover, recent technological developments and cost decreases have further enabled studies to collect expression data in multiple tissues. One advantage of multiple tissue datasets is that studies can combine results from different tissues to identify eQTLs more accurately than examining each tissue separately. The idea of aggregating results of multiple tissues is closely related to the idea of meta-analysis which aggregates results of multiple genome-wide association studies to improve the power to detect associations. In principle, meta-analysis methods can be used to combine results from multiple tissues. However, eQTLs may have effects in only a single tissue, in all tissues, or in a subset of tissues with possibly different effect sizes. This heterogeneity in terms of effects across multiple tissues presents a key challenge to detect eQTLs. In this paper, we develop a framework that leverages two popular meta-analysis methods that address effect size heterogeneity to detect eQTLs across multiple tissues. We show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods. Additionally, our method provides an interpretation framework that accurately predicts whether an eQTL has an effect in a particular tissue.

Cover page: Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches