## Finding Genes Related to Disease Using Statistical Learning

- Author(s): Goldstein, Benjamin Alan
- Advisor(s): Hubbard, Alan E.
- et al.

## Abstract

This dissertation consists of the analyses of three separate genetic association datasets. Each represents a unique data structure with a different question of interest that therefore require distinct approaches and methodologies. As such, the three substantive chapters (2-4) can each stand on their own. However, the over-arching question in each of these studies is the same: which genes (or genetic material) are related to the disease or outcome being studied. Moreover, while the methodologies are each distinct, they all incorporate statistical learning methodologies to obtain some modicum of inference.

Study 1 - As computational power has improved the application of statistical learning algorithms to finding SNPs related to disease has become more ubiquitous. The hope is that these algorithms will be more capable than typical marginal testing in detecting SNPs with higher order effects. The Random Forests (RF) algorithm is one such algorithm that has seen increased use with genetic data. As part of its output, RF ranks the predictor variables (SNPs) on their relative importance.

The present study represents the first application of the RF algorithm to Genome Wide Association (GWA) data and investigates how best to use the algorithm for this unique data structure. A multiple sclerosis (MS) GWA data set is used for the analysis. Results indicate the typical tuning parameter settings need to be adjusted for the high degree of sparsity in the data. Furthermore, most meaningful results were obtained when both unimportant and overly important SNPs were removed. RF was able to replicate some previous findings using the same data. Moreover, four genes not previously associated with MS were identified.

Study 2 - In many analyses, one has data on one level but desires to draw inference on another level. For example, in genetic association studies, one observes units of DNA referred to as SNPs, but wants to determine whether genes that are comprised of SNPs are associated with disease. While there are some available approaches for addressing this issue, they usually involve making parametric assumptions and are not easily generalizable. A statistical test is proposed for testing the association of a set of variables with an outcome of interest. No assumptions are made about the functional form relating the variables to the outcome. A general function is fit using any statistical learning algorithm, with the SuperLearner algorithm suggested. The parameter of interest is the cross-validated risk and this is compared to an expected risk. A Wald test is proposed using the influence curve of the cross-validated risk to obtain the variance. It is shown both theoretically and via simulation that the test maintains appropriate type I error control and is more powerful than parametric tests under more general alternatives. The test is applied to an MS candidate gene study. Three separate analyses are performed highlighting the flexibility of the approach.

Study 3 - Secondary analyses, such as Gene Ontology and Motif analysis, have become central components of gene expression experiments, allowing researchers to derive biological understanding from the set of genes that are differentially expressed. An important statistical task is determining which genes should be passed on to such programs and how the genes should be grouped for analysis. The typical approach is to cluster the set of differentially expressed genes, and pass these clusters on to the secondary analyses. However, many expression experiments have specific hypotheses which allow one to analyze the genes and group them in a more targeted approach. To illustrate the utility of being more specific, a gene expression study of C. elegans is used where a particular outcome was observed and hoped to be explained. A general

model is fit and analyzed to estimate the parameters corresponding to the specific hypothesis, leading to four natural groupings of the differentially expressed genes. These groupings lead to meaningful results in the secondary analyses that allow for the biologist to make robust hypotheses that are experimentally confirmed. It is shown that a traditional approach would not have yielded such robust findings.