Design of efficient and accurate statistical approaches to correct for confounding effects and identify true signals in genetic association studies
Over the past decades, genome-wide association studies have dramatically improved especially with the advent of the hight-throughput technologies such as microarray and next generation sequencing. Although genome-wide association studies have been extremely successful in identifying tens of thousands of variants associated with various disease or traits, many studies have reported that some of the associations are spurious induced by various confounding factors such as population structure or technical artifacts. In this dissertation, I focus on effectively and accurately identifying true signals in genome-wide association studies in the presence of confounding effects. First, I introduce a method that effectively identifying regulatory hotspots while correcting for false signals induced by technical confounding effects in expression quantitative loci studies. Technical confounding factors such as a batch effect complicates the expression quantitative loci analysis by inducing heterogeneity in gene expressions. This creates correlations between the samples and may cause spurious associations leading to spurious regulatory hotspots. By formulating the problem of identifying genetic signals in a linear mixed model framework, I show how we can identify regulatory hotspots while capturing heterogeneity in expression quantitative loci studies. Second, I introduce an efficient and accurate multiple-phenotype analysis method for high-dimensional data in the presence of population structure. Recently, large amounts of genomic data such as expression data have been collected from genome-wide association studies cohorts and in many cases it is preferable to analyze more than thousands of phenotypes simultaneously than analyze each phenotype one at a time. However, when confounding factors, such as population structure, exit in the data, even a small bias is induced by the confounding effects, the bias accumulates for each phenotype and may cause serious problems in multiple-phenotype analysis. By incorporating linear mixed model in the statistics of multivariate regression, I show we can increase the accuracy of multiple phenotype analysis dramatically in high- dimensional data. Lastly, I introduce an efficient multiple testing correction method in linear mixed model. The significance threshold differs as a function of species, marker densities, genetic relatedness, and trait heritability. However, none of the previous multiple testing correction methods can comprehensively account for these factors. I show that the significant threshold changes with the dosage of genetic relatedness and introduce a novel multiple testing correction approach that utilizes linear mixed model to account for the confounding effects in the data.