- Main
Statistical methods for molecular quantitative trait locus analysis
- Zhou, Heather J.
- Advisor(s): Li, Jingyi
Abstract
Molecular quantitative trait locus (molecular QTL, henceforth "QTL") analysis investigates the relationship between genetic variants and molecular traits, helping explain findings in genome-wide association studies. This dissertation addresses two major problems in QTL analysis: hidden variable inference problem and eGene identification problem.
Estimating and accounting for hidden variables is widely practiced as an important step in QTL analysis for improving the power of QTL identification. However, few benchmark studies have been performed to evaluate the efficacy of the various methods developed for this purpose. In my first project, I benchmark popular hidden variable inference methods including surrogate variable analysis (SVA), probabilistic estimation of expression residuals (PEER), and hidden covariates with prior (HCP) against principal component analysis (PCA)—a well-established dimension reduction and factor discovery method—via 362 synthetic and 110 real data sets. I show that PCA not only underlies the statistical methodology behind the popular methods but is also orders of magnitude faster, better performing, and much easier to interpret and use. To help researchers use PCA in their QTL analysis, I provide an R package PCAForQTL along with a detailed guide, both of which are available at https://github.com/heatherjzhou/PCAForQTL. I believe that using PCA rather than SVA, PEER, or HCP will substantially improve and simplify hidden variable inference in QTL mapping as well as increase the transparency and reproducibility of QTL research.
A central task in expression quantitative trait locus (eQTL) analysis is to identify cis-eGenes (henceforth "eGenes"), i.e., genes whose expression levels are regulated by at least one local genetic variant. Among the existing eGene identification methods, FastQTL is considered the gold standard but is computationally expensive as it requires thousands of permutations for each gene. Alternative methods such as eigenMT and TreeQTL have lower power than FastQTL. In my second project, I propose ClipperQTL, which reduces the number of permutations needed from thousands to 20 for data sets with large sample sizes (>450) by using the contrastive strategy developed in Clipper; for data sets with smaller sample sizes, it uses the same permutation-based approach as FastQTL. I show that ClipperQTL performs as well as FastQTL and runs about 500 times faster if the contrastive strategy is used and 50 times faster if the conventional permutation-based approach is used. The R package ClipperQTL is available at https://github.com/heatherjzhou/ClipperQTL. This project demonstrates the potential of the contrastive strategy developed in Clipper and provides a simpler and more efficient way of identifying eGenes.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-