- Main

## Geometry of maximum likelihood estimation in Gaussian graphical models

- Author(s): Uhler, Caroline
- Advisor(s): Sturmfels, Bernd
- et al.

## Abstract

Algebraic statistics exploits the use of algebraic techniques to develop new paradigms and algorithms for data analysis. The development of computational algebra software provides a powerful tool to analyze statistical models. In Part I of this thesis, we use methods from computational algebra and algebraic geometry to study Gaussian graphical models. Algebraic methods have proven to be useful for statistical theory and applications alike. We describe a particular application to computational biology in Part II.

Part I of this thesis investigates geometric aspects of maximum likelihood estimation in Gaussian graphical models. More generally, we study multivariate normal models that are described by linear constraints on the inverse of the covariance matrix. Maximum likelihood estimation for such models leads to the problem of maximizing the determinant function over a spectrahedron, and to the problem of characterizing the image of the positive definite cone under an arbitrary linear projection. In Chapter 2, we examine these problems at the interface of statistics and optimization from the perspective of convex algebraic geometry and characterize the cone of all sufficient statistics for which the maximum likelihood estimator (MLE) exists. In Chapter 3, we develop an algebraic elimination criterion, which allows us to find exact lower bounds on the number of observations needed to ensure that the MLE exists with probability one. This is applied to bipartite graphs, grids and colored graphs. We also present the first instance of a graph for which the MLE exists with probability one even when the number of observations equals the treewidth. Computational algebra software can be used to study graphs with a limited number of vertices and edges. In Chapter 4, we study the problem of existence of the MLE from an asymptotic point of view by fixing a class of graphs and letting the number of vertices grow to infinity. We prove that for very large cycles already two observations are sufficient for the existence of the MLE with probability one.

Part II of this thesis describes an application of algebraic statistics to association studies. Rapid research progress in genotyping techniques have allowed large genome-wide association studies. Existing methods often focus on determining associations between single loci and a specific phenotype. However, a particular phenotype is usually the result of complex relationships between multiple loci and the environment. We develop a method for finding interacting genes (i.e. epistasis) using Markov bases. We test our method on simulated data and compare it to a two-stage logistic regression method and to a fully Bayesian method, showing that we are able to detect the interacting loci when other methods fail to do so. Finally, we apply our method to a genome-wide dog data set and identify epistasis associated with canine hair length.