Due to the grow of modern dataset size and the desire to harness computing power of multiple machines, there is a recent surge of interest in the design of distributed machine learning algorithms. There are many algorithms in this research area. First order methods use gradient information for update in each iteration. Second order methods employ second order information to harness the computer power of each worker and reduce the cost of communication. Zeroth order methods use estimated gradient information to solve the problem where the gradient is not available. However, since distributed algorithms require communication between workers and server, they are sensitive to Byzantine attackers who can send falsified data to prevent the convergence of algorithms or lead the algorithms to converge to value of the attackers' choice. Some recent work proposed interesting first order algorithms that can deal with the scenario when up to half of the workers are compromised.
In this thesis, we investigate different order algorithms that can deal with Byzantine attackers.
Firstly, we discuss the robust first order distributed algorithms in distributed network. A commonly used algorithm is distributed gradient descent algorithm. But most existing algorithms assume that there are no attack in the network. However, in practice, there is a risk that the some workers are compromised. We provide a novel first order algorithm that can deal with an arbitrary number of Byzantine attackers. The main idea is to ask the parameter server to randomly select a small clean dataset and compute noisy gradient using this small dataset. This noisy gradient will then be used as a ground truth to filter out information sent by compromised workers. We show that the proposed algorithm converges to the neighborhood of the population minimizer regardless the number of Byzantine attackers. We also proposed an algorithm that deal with arbitrary number of Byzantine attackers when we know the upper number of the Byzantine attackers. We show this algorithm can have a better convergence rate than the former one. We further provide numerical examples to show that the proposed algorithm can benefit from the presence of good workers and achieve better performance than existing algorithms.
Secondly, we discuss the robust second order distributed algorithms that can deal with Byzantine attackers. We propose two robust second-order algorithms. The main idea of the first algorithm, named median-based approximate Newton's method (MNM), is to ask the parameter server to aggregate gradient information and approximate Newton's direction from all workers by geometric median. We show that MNM can converge when up to half of the workers are Byzantine attackers. To deal with the case with an arbitrary number of attackers, we then propose a comparison-based approximate Newton's method (CNM). The main idea of CNM is to ask the server to randomly select a small clean dataset and compute noisy gradient and Newton's direction using this small dataset. These noisy information will then be used as an approximation of the ground truth to filter out bad information from Byzantine attackers. We show that CNM can converge to the neighborhood of the population minimizer even when more than half of the workers are Byzantine workers. We further provide numerical examples to illustrate the performance of the proposed algorithms.
Finally, we discuss the robust zeroth order distributed algorithm in decentralized distributed network. We propose a zeroth order adversarial robust alternating direction method of multipliers (ZOAR-ADMM) that can deal with Byzantine attackers for the zeroth-order methods in a consensus network. The main idea of the algorithm is to ask each worker store a local deviation statistics of distance between neighbor’s model parameter and its own model parameter for every neighbor. These information will then be used to filter out bad model parameter from Byzantine attackers. We show that this algorithm can converge to the sample minimizer and the function can converge to the optimal value. We further provide numerical examples to illustrate the performance of the proposed algorithm.