Open Access Publications from the University of California

## Projection algorithms for large scale optimization and genomic data analysis

• Author(s): Keys, Kevin Lawrence
• et al.
Abstract

The advent of the Big Data era has spawned intense interest in scalable mathematical optimization methods. Traditional approaches such as Newton’s method fall apart whenever the features outnumber the examples in a data set. Consequently, researchers have intensely developed first-order methods that rely only on gradients and subgradients of a cost function.

In this dissertation we focus on projected gradient methods for large-scale con-

strained optimization. We develop a particular case of a proximal gradient method

called the proximal distance algorithm. Proximal distance algorithms combine the

classical penalty method of constrained minimization with distance majorization. To

optimize the loss function $f(x)$ over a constraint set $C$, the proximal distance principle mandates minimizing the penalized loss $f(x) + \rho \mathrm{dist} \; (x,C)^2$ and following the solution $x_{\rho}$ to its limit as $\rho \to \infty$. At each iteration the squared Euclidean distance $\mathrm{dist} \; (x, C)^2$ is majorized by $\| x − \Pi_{C}(x_k) \|_2^2$, where $\Pi_{C}(x_k)$ denotes the projection of the current iterate $x_k$ onto $C$. The minimum of the surrogate function $f(x) + \rho \| x − \Pi_{C} (x_k) \|_2^2$ is given by the proximal map $\mathrm{prox}_{ρ^{−1}} \; f [ \Pi_{C} (x_k )]$. The next iterate $x_{k+1}$ automatically decreases the original penalized loss for fixed $\rho$. Since many explicit projections and proximal maps are known in analytic or computable form, the proximal distance algorithm provides a scalable computational framework for a variety of constraints.

For the particular case of sparse linear regression, we implement a projected gradient algorithm known as iterative hard thresholding for a particular large-scale genomics analysis known as a genome-wide association study. A genome-wide association study (GWAS) correlates marker variation with trait variation in a sample of individuals. Each study subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. Over the past decade, researchers have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with LASSO or MCP penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage desktop workstations in GWAS analysis and to eschew expensive supercomputing resources. We evaluate IHT performance on both simulated and real GWAS data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression.