Randomized Algorithms for Scalable Machine Learning
Many existing procedures in machine learning and statistics are computationally intractable in the setting of large-scale data. As a result, the advent of rapidly increasing dataset sizes, which should be a boon yielding improved statistical performance, instead severely blunts the usefulness of a variety of existing inferential methods. In this work, we use randomness to ameliorate this lack of scalability by reducing complex, computationally difficult inferential problems to larger sets of significantly smaller and more tractable subproblems. This approach allows us to devise algorithms which are both more efficient and more amenable to use of parallel and distributed computation. We propose novel randomized algorithms for two broad classes of problems that arise in machine learning and statistics: estimator quality assessment and semidefinite programming. For the former, we present the Bag of Little Bootstraps (BLB), a procedure which incorporates features of both the bootstrap and subsampling to obtain substantial computational gains while retaining the bootstrap's accuracy and automation; we also present a novel diagnostic procedure which leverages increasing dataset sizes combined with increasingly powerful computational resources to render existing estimator quality assessment methodology more automatically usable. For semidefinite programming, we present Random Conic Pursuit, a procedure that solves semidefinite programs via repeated optimization over randomly selected two-dimensional subcones of the positive semidefinite cone. As we demonstrate via both theoretical and empirical analyses, these algorithms are scalable, readily benefit from the use of parallel and distributed computing resources, are generically applicable and easily implemented, and have favorable theoretical properties.