Distributed machine learning bridges the traditional fields of distributed systems and machine learning, nurturing a rich family of research problems. Classical machine learning algorithms process the data by a single-thread procedure, but as the scale of the dataset and the complexity of the models grow rapidly, it becomes prohibitively slow to process on a single machine. The usage of distributed computing involves several fundamental trade-offs. On one hand, the computation time is reduced by allocating the data to multiple computing nodes. But since the algorithm is parallelized, there are compromises in terms of accuracy and communication cost. Such trade-offs puts our interests in the intersection of multiple areas, including statistical theory, communication complexity theory, information theory and optimization theory.
In this thesis, we explore theoretical foundations of distributed machine learning under communication constraints. We study the trade-offs between communication and computation, as well as the trade-off between communication and learning accuracy. In particular settings, we are able to design algorithms that don't compromise on either side. We also establish fundamental limits that apply to all distributed algorithms. In more detail, this thesis makes the following contributions:
* We propose communication-efficient algorithms for statistical optimization. These algorithms achieve the best possible statistical accuracy and suffer the least possible computation overhead.
* We extend the same algorithmic idea to non-parametric regression, proposing an algorithm which also guarantees the optimal statistical rate and superlinearly reduces the computation time.
* In the general setting of regularized empirical risk minimization, we propose a distributed optimization algorithm whose communication cost is independent of the data size, and is only weakly dependent on the number of machines.
* We establish lower bounds on the communication complexity of statistical estimation and linear algebraic operations. These lower bounds characterize the fundamental limits of any distributed algorithm.
* We design and implement a general framework for parallelizing sequential algorithms. The framework consists of a programming interface and an execution engine. The programming interface allows machine learning experts to implement the algorithm without concerning any detail about the distributed system. The execution engine automatically parallelizes the algorithm in a communication-efficient manner.