Qsparse-local-SGD: Communication Efficient Distributed SGD with Quantization, Sparsification, and Local Computations
Large scale distributed optimization has become increasingly important with the emergence of edge computation architectures such as in the federated learning setup, where large amounts of data, possibly of a secure nature and generated in an online manner can be massively distributed across personal devices. A key bottleneck for many such large-scale problems is in the communication overhead of exchanging information between devices over bandwidth limited networks as well as in the unreliability of communication for distributed optimization. The existing approaches propose to mitigate these bottlenecks either by using different forms of compression or by computing local models and mixing them iteratively. In this thesis we first propose a novel class of highly communication efficient operators that employ stochastic and deterministic quantization with aggressive sparsification such as Top-k in the form of a composed operator. Furthermore, in federated learning one can use local computations to reduce communication. Using such a framework, we incorporate local iterations into our algorithm which allows the communication to be infrequent and possibly asynchronous thereby enabling significantly reduced communication.
Putting them together we have distributed Qsparse-local-SGD for federated learning for which our analysis demonstrates convergence rates matching vanilla distributed SGD where we observe that quantization and sparsification are almost for free for smooth functions, both non-convex and convex. We characterize the asymptotic allowable limits of local iterations for synchronous and asynchronous implementations of Qsparse-local-SGD, so as to harness both the distributed processing gains as well as the benefits of quantization, sparsification and local computations. Our numerics demonstrate that Qsparse-local-SGD combines the bit savings of our composed operators, as well as local computations, thereby outperforming the cases where these techniques are individually used. We use it to train ResNet-50 on ImageNet, as well as a softmax multi-class classifier on MNIST, resulting in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.