Skip to main content
eScholarship
Open Access Publications from the University of California

Divide and Conquer Algorithms for Faster Machine Learning

  • Author(s): Izbicki, Michael
  • Advisor(s): Shelton, Christian R
  • et al.
Creative Commons Attribution-ShareAlike 4.0 International Public License
Abstract

This thesis improves the scalability of machine learning by studying mergeable learning algorithms. In a mergeable algorithm, many processors independently solve the learning problem on small subsets of the data. Then a master processor merges the solutions together with only a single round of communication. Mergeable algorithms are popular because they are fast, easy to implement, and have strong privacy guarantees.

Our first contribution is a novel fast cross validation procedure suitable for any mergeable algorithm. This fast cross validation procedure has a constant runtime independent of the number of folds and can be implemented on distributed systems. This procedure is also widely applicable. We show that 32 recently proposed learning algorithms are mergeable and therefore fit our cross validation framework. These learning algorithms come from many subfields of machine learning, including density estimation, regularized loss minimization, dimensionality reduction, submodular optimization, variational inference, and markov chain monte carlo.

We also provide two new mergeable learning algorithms. In the context of regularized loss minimization, existing merge procedures either have high bias or slow runtimes. We introduce the optimal weighted average (OWA) merge procedure, which achieves both a low bias and fast runtime. We also improve the cover tree data structure for fast nearest neighbor queries by providing a merge procedure. In doing so, we improve both the theoretical guarantees of the cover tree and its practical runtime. For example, the original cover tree was able to find nearest neighbors in time $O(\cexp^{12}\log n)$, and we improve this bound to $O(\chole^4\log n)$ for i.i.d.\ data. Here, $\cexp$ and $\chole$ are measures of the "intrinsic dimensionality" of the data, and on typical datasets $\cexp>\chole$. Experiments on large scale ad-click, genomics, and image classification tasks empirically validate these algorithms.

Main Content
Current View