The Sparse Principal Component Analysis (Sparse PCA) problem is a variant of the classical PCA problem. The goal of Sparse PCA is to achieve a trade-off between the explained variance along a normalized vector, and the number of non-zero components of that vector. Sparse PCA has a wide array of applications in machine learning and engineering. Unfortunately, this problem is also combinatorially hard and hence various sub-optimal algorithms and approximation formulations have been proposed to tackle it. In this dissertation, we first discuss convex relaxation techniques that efficiently produce good approximate solutions. We then describe several algorithms solving these relaxations as well as greedy algorithms that iteratively improve the solution quality.
The dissertation then focuses on solving the attractive formulation called DSPCA (a Direct formulation for Sparse PCA) for large-scale problems. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. We demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have rapidly decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features.
Another focus of the dissertation is to illustrate the utility of Sparse PCA in various applications, ranging from senate voting and finance to text mining. In particular, we apply Sparse PCA to the analysis of text data, with online news as our focus. Our experimental results on various data sets illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models.