# Your search: "author:"Xu, Shizhong""

## filters applied

## Type of Work

Article (9) Book (0) Theses (8) Multimedia (0)

## Peer Review

Peer-reviewed only (17)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (1)

## Publication Year

## Campus

UC Berkeley (0) UC Davis (1) UC Irvine (1) UCLA (0) UC Merced (0) UC Riverside (15) UC San Diego (0) UCSF (0) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (0) Lawrence Berkeley National Laboratory (0) UC Agriculture & Natural Resources (0)

## Department

Institute for Clinical and Translational Science (1)

## Journal

## Discipline

Engineering (1) Medicine and Health Sciences (1)

## Reuse License

BY - Attribution required (1)

## Scholarly Works (17 results)

Genomic selection is a marker-assisted methodology that dramatically decreases the cost of measuring phenotypes by using the whole-genome information to predict and select desirable individuals. In plant breeding, it plays an important role to speed up the breeding cycles. Modern techniques make obtaining marker information from the entire genome feasible. However, it results in high dimensionality of predictors when we implement a mathematical model to estimate the parameters and predict future crosses. Many statistical models including variable selection models can address this problem and have been applied in genomic selection. Variable selection models can also be applied in GWAS which is a powerful tool to discover the association between genetic variation and variation in quantitative traits.

A novel statistical approach based on BLUP was proposed to be implemented in both genomic selection and GWAS. The general idea of the proposed approach is using an algorithm to divide markers into the small effect group and the large effect group. Markers within the large effect group can be potentially significant markers associated with the analyzed phenotypic trait. In Chapter 3, we used simulated data and two real-world data sets to demonstrate the distinctions among six statistical methods for genomic selection. In addition, the proposed model was applied in GWAS based on another simulated data, and the proposed model is superior to the other two variable selection models.

Quantitative trait loci (QTL) mapping is one of the applications of statistics in genetics.This dissertation focuses two problems on QTL mapping which include a newpermutation method used to find the thresholds for the shrinkage Bayesian estimation ofquantitative trait loci parameters and three algorithms of handling the missing genotype

problems in multiple QTL mapping under the generalized linear mixed model framework.In addition, this dissertation includes a review on Bayesian statistics and somedata analyses using Markov chain Monte Carlo (MCMC).

Chapter 2 is a review of the Bayesian statistics and some data analyses usingMCMC. It includes almost all the aspects of Bayesian statistics such as Bayes' theorem, prior and posterior distributions, Bayesian inference, and Markov chain Monte Carlo (MCMC) algorithms.

In Chapter 3, a new way to conduct the permutation test under the Shrinkage Bayesian method is developed. Permutation test is the most frequently used method for statistical test for QTL mapping. And it was applied on the QTL mapping based on the Bayesian approach. While using the traditional permutation test to get the thresholds for QTL mapping from the MCMC algorithms in the Bayesian models is

quite time-consuming, a new way to permute the samples from the MCMC algorithmsis performed in Chapter 3. Empirical power analysis is done to test the method through the simulations.

Generalized linear mixed model has been applied to analyze the discrete traits. Research on handling the missing genotype problems in multiple QTL mapping under the generalized linear mixed model framework is presented in Chapter 4. Three algorithms were proposed: (1) expectation algorithm, (2) overdispersion model algorithm and (3)

mixture model algorithm.

Advances in DNA sequencing technologies allow us to genotype most of the genetic variants and investigate their effects on phenotypes. Although many genes controlling Mendelian disorders were successfully identified in the past two decades, the genetic mechanisms underlying complex traits controlled by lots of genes with small effects are still not well understood. It becomes desirable to develop more powerful statistical methods that can integrate information from the quantitative traits, gene expression and high-density genetic markers, and precisely identify the genetic variants for complex traits.

In Chapter 2, we developed a stochastic expectation-maximization algorithm for mixture model-based cluster analysis which is a general framework for integrated study for genetic variant, gene expression and phenotype. The strength of association is modeled using Gaussian mixture with two components. The sampling step in stochastic EM algorithm improves the convergence of parameters when initial values are poor. The same mixture model and stochastic EM algorithm can be used to identify expression QTL and association study between gene expression and quantitative trait.

In Chapter 3, we proposed a generalized linear mixed model for mapping segregation distortion loci which can affect the viability of individuals in a population. This dissertation presents a method in which the segregation distortion analysis is formulated as a quantitative genetics problem using hypothetical liability. The generalized linear mixed model contains the genetic variants across the whole genome and estimates genetic effects using Bayesian approach which only requires likelihood function, linear predictor and prior distribution. The mixed model approach is able to handle high-dimensional genomic data.

In Chapter 4, adaptive ridge regression method is used to estimate the collective effects of rare variants within the same functional group for continuous traits. The adaptive ridge regression model does not assume the directions of the effects. The shared variance for one group is used as a score for testing the overall effects of rare variants. Genetic variants in the same group are selectively weighed to prevent the shared variance being diluted by non-functional variants. The adaptive ridge regression method can be easily extended to handle multiple groups of rare variants.

Genome-wide association studies (GWAS) are statistical tools widely used to identify the associations between genetic variants and a quantitative trait. Through GWAS, the genetic architectures of many complex traits in plants, animals and human have been revealed. A commonly used method in GWAS is the linear mixed model (LMM). This model is called the fixed model (FM) approach when the marker effect is treated as a fixed effect. In contrast to the FM approach, the scanned marker can also be treated as a random effect and such a method is called the random model (RM) approach. The RM approach allows the use of the effective number of tests to perform Bonferroni correction and thus significantly increases the statistical power. However, the RM approach requires estimation of two genetic variance components (the variance of the scanned marker and the polygenic variance) and thus involves high computational cost. The main focus of this dissertation is the development of a new method named randomized fixed model (RFM) methodology. By this method, we can perform the RM GWAS using results of the FM analysis without involving additional computation.

There are three chapters in this dissertation. The first chapter introduces the main concepts in GWAS, LMM and corrections for multiple hypotheses testing. The second chapter describes the RFM methodology, and demonstrates in both simulated data and real human data that the RFM is as powerful as the RM, with reduced computational complexity. In the third chapter, an outlier detection approach using a mixture model for significance test is described. Compared to Bonferroni correction method, this approach boosts the statistical power with the genome-wide type I error rate still controlled below 0.05. Thus, the outlier detection approach can be an alternative method for Bonferroni correction.

The state of the art GWAS under the linear mixed model framework, although vastly improved, still suffers from high computational cost and type I error rate. Approaches like EMMA (Kanget al. 2008), GEMMA (Zhou and Stephens, 2012) and EMMAX (Kang et al. 2010) among others are better when compared to the traditional GWAS approach, but they are still computationally slow. The purpose of this dissertation is to illustrate our new approach called the RFM, which can be applied to the linear mixed model GWAS. We will show that the RFM approach is more efficient approach that saves tremendous computational time while also lowering the type I error rate.

Chapter one will introduce GWAS and briefly discuss the generalized linear mixed models theory that are typically used in GWAS. Chapter two will detail the linear mixed model theory and methodology and its application to GWAS. We will overview the different techniques that can be used to estimate the unknown parameters in the linear mixed models, as well as discuss in details the mathematics behind those techniques that are directly used in our research. Chapter 3 will illustrate the application of the RFM method to GWAS. We will apply the RFM method on the simulated datasets used in chapter two and compare the results. We will also use RFM methodology to conduct GWAS on two actual datasets, both containing whole genome sequences. We will compare and discuss our GWAS results obtained using the RFM method to the GWAS results obtained using SAS's proc mixed and R.

Quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS) are still the necessary first steps towards gene discovery. With the ever-growing number of genetic markers, more efficient algorithms for genetic mapping are necessary, especially in the big data era when QTL mapping and GWAS are to be conducted simultaneously for thousand traits, e.g., metabolomic traits. Furthermore, the conventional genomic scanning approaches that detect one locus at a time are subject to many problems, including large matrix inversion, over-conservativeness for tests after Bonferroni correction and difficulty in evaluation of the total genetic contribution to a trait’s variance. Targeting these problems, we take a further step and investigate the multiple locus model that detects all markers simultaneously in a single model.

The ordinary ridge regression (ORR) is well known for its high computational efficiency and analysis of the data with multicollinearity. However, ORR has never been widely applied to QTL mapping and GWAS due to its severe shrinkage on the estimated effects. Here we introduce a degree of freedom for each parameter and use it to deshrink both the estimated effect and its estimation error so that the Wald test is brought back to the same level as the Wald test of typical GWAS methods, such as efficient mixed model association (EMMA). The new method is called deshrinking ridge regression (DRR). Using sample data of small, medium and large model sizes, we demonstrate that DRR is efficient for all three model sizes while EMMA only works for medium and large models. We also developed a sparse Bayesian learning (SBL) method for QTL mapping and GWAS. This new method adopts coordinate descent algorithm to estimate parameters by updating one parameter at a time conditional on current values of all other parameters. It uses an L2 type of penalty that allows the method to handle extremely large sample sizes (>100,000). Simulation studies show that SBL often has higher statistical powers and the simulated true loci are often detected with extremely small p-values, indicating that SBL is insensitive to stringent thresholds in significance testing.

Genome-wide association study (GWAS) has became a powerful tool for revealing the genetic architecture of complex traits in plant studies, animal research and human disease. This method involves scanning genotypes from many different samples to study the associations between genetic markers and phenotypes. With the availability of low-costing and high-throughput technology, large-scale data are provided for analysis and efficient algorithms are needed to scan up to millions markers. Large-scale genomic study also involves high-dimensional statistics, which brings out lots of difficulties in modeling and computation in practice. This dissertation addresses two problems in GWAS, that are, the computational efficiency of marker scanning and correction of Beavis effect. The interesting connection between the fixed effect and the random effect in a linear mixed model is the inspiration for the current work. The methods we proposed are fully supported theoretically and empirically.

In the first half of this dissertation, we investigate the significant test of markers in GWAS and propose a method for constructing a de-shrink Ridge estimator. This enables us to scan all the markers simultaneously in one model. The de-shrink estimators and test statistic are fast to compute. They also have comparable level as the conventional GWAS approaches, such as efficient mixed model association (EMMA). We also prove that given sufficient information the de-shrink estimators are asymptotically equivalent to the fixed effect estimators in EMMA.

The second half of this dissertation is focusing on correcting the bias caused by the Beavis effect in GWAS. The Beavis effect refers to a phenomenon that the average effect size of the detected locus is inflated due to statistical tests. There is an increasing interest in applying linear mixed model in GWAS and the scanned marker is typically treated as fixed effect, which is called fixed model (FM) approach. Another way to tackle the same problem is considering the marker effect as random and this method is called random model (RM) approach. However, the random term results in extra computational burden. We develop a novel random fixed approach (RFM) to relieve the computational difficulties. Taking advantage of RFM and the censoring fact, we propose an efficient way to correct the Beavis effect. We demonstrate the method in simulated dataset and real data applications.