Design of efficient and statistically powerful approaches for human genetics
- Author(s): Sul, Jae Hoon
- Advisor(s): Eskin, Eleazar
- et al.
The advent of genotyping and sequencing technologies has enabled human genetics to discover numerous genetic variants associated with many diseases and traits over the past decades. One of the most effective approaches to detect those variants has been genome-wide association studies (GWASs) that scan all variants found in genomes. GWASs collect people with a disease (called "cases") and people without a disease (called "controls") and compare allele frequencies between cases and controls to identify genetic variants associated the disease. This simple yet effective approach has been widely utilized by many studies, and more than 1,600 GWASs have been published during the last decade.
An underlying assumption of GWAS is that cases and controls are sampled from the same population. If they are not, then a phenomenon called "population structure" may cause spurious associations. Correcting for population structure in GWASs has been a very important problem in human genetics, and several methods have been proposed. However, those methods fail to correct for complex structure or are computationally too challenging for current GWAS datasets. I will introduce a new statistical approach that correctly removes effects of population structure and reduces the computational time from years to hours.
Recently, sequencing technologies that enable a detection of rare variants have received considerable attention and been utilized by many GWASs. In these studies, rare variants in a gene are often grouped together to test the aggregated effect of rare variants on disease susceptibility. However, there are many different approaches to combine information of multiple rare variants, and it is unknown which approach is optimal in detecting associations of rare variants. I will introduce two novel approaches to better identify a group of rare variants involved in a disease. I will show using simulations that our approaches outperform previous methods, and using real sequencing data, I will show that our methods can identify an association reported by a previous study.
Finally, I will introduce a statistical approach to identify expression quantitative trait loci (eQTL) or genetic variants that are associated with gene expression in multiple tissues. Recent technological developments and cost decreases have enabled eQTL studies to collect expression data in multiple tissues, but most studies focus on finding eQTLs in each tissue separately. I will introduce a statistical approach that combines results from multiple tissues to better identify eQTLs. I will show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods.