 Main
Projection algorithms for large scale optimization and genomic data analysis
 Keys, Kevin Lawrence
 Advisor(s): Lange, Kenneth L
Abstract
The advent of the Big Data era has spawned intense interest in scalable mathematical optimization methods. Traditional approaches such as Newton’s method fall apart whenever the features outnumber the examples in a data set. Consequently, researchers have intensely developed firstorder methods that rely only on gradients and subgradients of a cost function.
In this dissertation we focus on projected gradient methods for largescale con
strained optimization. We develop a particular case of a proximal gradient method
called the proximal distance algorithm. Proximal distance algorithms combine the
classical penalty method of constrained minimization with distance majorization. To
optimize the loss function $f(x)$ over a constraint set $C$, the proximal distance principle mandates minimizing the penalized loss $f(x) + \rho \mathrm{dist} \; (x,C)^2$ and following the solution $x_{\rho}$ to its limit as $\rho \to \infty$. At each iteration the squared Euclidean distance $\mathrm{dist} \; (x, C)^2$ is majorized by $\ x − \Pi_{C}(x_k) \_2^2$, where $\Pi_{C}(x_k)$ denotes the projection of the current iterate $x_k$ onto $C$. The minimum of the surrogate function $f(x) + \rho \ x − \Pi_{C} (x_k) \_2^2$ is given by the proximal map $\mathrm{prox}_{ρ^{−1}} \; f [ \Pi_{C} (x_k )]$. The next iterate $x_{k+1}$ automatically decreases the original penalized loss for fixed $\rho$. Since many explicit projections and proximal maps are known in analytic or computable form, the proximal distance algorithm provides a scalable computational framework for a variety of constraints.
For the particular case of sparse linear regression, we implement a projected gradient algorithm known as iterative hard thresholding for a particular largescale genomics analysis known as a genomewide association study. A genomewide association study (GWAS) correlates marker variation with trait variation in a sample of individuals. Each study subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. Over the past decade, researchers have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with LASSO or MCP penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage desktop workstations in GWAS analysis and to eschew expensive supercomputing resources. We evaluate IHT performance on both simulated and real GWAS data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression.
Main Content
Enter the password to open this PDF file:













