Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Large-scale sparse regression models under weak assumptions

Abstract

Many modern problems in science and other areas involve extraction of useful information from so-called 'big data.' However, many classical statistical techniques are not equipped to meet with the challenges posed by big data problems. Furthermore, existing statistical methods often result in intractable algorithms. Consequently the last $15-20$ years has seen a flurry of research on adapting existing methods and developing new methods that overcome some of the statistical and computational challenges posed by problems involving big data.

Regression is one of the oldest statistical techniques. For many modern regression problems involving big datasets, the number of predictors or covariates $\pdim$ is large compared the number of samples $n$, causing significant computational and statistical challenges. To overcome these challenges, many researchers have proposed imposing sparsity on the vector of regression co-efficients $\beta \in \mathbb{R}^{\pdim}$. Furthermore, researchers have proposed using $\ell_1$-based convex penalties for estimating $\beta$ under the sparsity assumption since they yield implementable algorithms with desirable performance guarantees. While there was already an established body of work on developing procedures for sparse regression models, most existing results rely on very restrictive model assumptions. These assumptions are often not satisfied for many scientific problems. In this thesis, we relax $3$ restrictive model assumptions that are commonly imposed in the literature for estimating sparse regression models.

The $3$ assumptions are: (1) Strict sparsity, that is the vector of regression co-efficients $\beta$ contains only a small number of non-zeros; (2) The covariates or predictors are independent; (3) Response depends linearly on covariates. Given that these $3$ model assumptions are often not satisfied for many practical settings, it is important to understand whether existing theoretical results exhibit robustness to these assumptions.

In Chapter 2, we impose a weaker notion of sparsity known as $\ell_q$-ball sparsity on $\beta$ which ensures the vector of regression co-efficients lies in an $\ell_q$ ball, but need not have any non-zeros. We prove that under the weaker $\ell_q$-ball sparsity assumption, it is possible to develop estimators with desirable mean-squared error behavior, even in the regime where $\pdim \gg n$.

The weakest known condition under which the Lasso achieves optimal mean-squared error rate is the restricted eigenvalue condition~\cite{vandeGeer07,BicRitTsy08, Negahban09}. Existing results prove that in cases when the covariates are independent, the restricted eigenvalue condition is satisfied. However, the setting when predictors or covariates are correlated are also of interest and there was considerably less work dealing with this case. In Chapter 3, we prove that the restricted eigenvalue condition is satisfied for various correlated Gaussian designs, including time series models, spiked covariance models and others.

Finally, in Chapter 4 we analyze sparse additive models, a non-parametric analog of sparse linear models, in which each component function lies in an ellipsoid or more formally a Reproducing kernel Hilbert space $\Hil$. Hence we weaken the assumption that our response depends on the covariate via a linear function. A new $\ell_1$-based polynomial-time method is developed and we prove that this method has desirable mean-squared error performance, even when $\pdim \gg n$. Furthermore, we prove lower bounds on the mean-squared error for estimating sparse additive models that match the upper bounds for our method. Hence our algorithm is optimal in terms of mean-squared error rate.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View