Application of Multiple Locus Linear Mixed Model in Linkage Analyses and Association Studies
- Author(s): Wang, Meiyue
- Advisor(s): Xu, Shizhong
- et al.
Quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS) are still the necessary first steps towards gene discovery. With the ever-growing number of genetic markers, more efficient algorithms for genetic mapping are necessary, especially in the big data era when QTL mapping and GWAS are to be conducted simultaneously for thousand traits, e.g., metabolomic traits. Furthermore, the conventional genomic scanning approaches that detect one locus at a time are subject to many problems, including large matrix inversion, over-conservativeness for tests after Bonferroni correction and difficulty in evaluation of the total genetic contribution to a trait’s variance. Targeting these problems, we take a further step and investigate the multiple locus model that detects all markers simultaneously in a single model.
The ordinary ridge regression (ORR) is well known for its high computational efficiency and analysis of the data with multicollinearity. However, ORR has never been widely applied to QTL mapping and GWAS due to its severe shrinkage on the estimated effects. Here we introduce a degree of freedom for each parameter and use it to deshrink both the estimated effect and its estimation error so that the Wald test is brought back to the same level as the Wald test of typical GWAS methods, such as efficient mixed model association (EMMA). The new method is called deshrinking ridge regression (DRR). Using sample data of small, medium and large model sizes, we demonstrate that DRR is efficient for all three model sizes while EMMA only works for medium and large models. We also developed a sparse Bayesian learning (SBL) method for QTL mapping and GWAS. This new method adopts coordinate descent algorithm to estimate parameters by updating one parameter at a time conditional on current values of all other parameters. It uses an L2 type of penalty that allows the method to handle extremely large sample sizes (>100,000). Simulation studies show that SBL often has higher statistical powers and the simulated true loci are often detected with extremely small p-values, indicating that SBL is insensitive to stringent thresholds in significance testing.