This dissertation focuses on stochastic gradient learning for problems involving large data sets or large feature spaces. One of the main advantages of the stochastic gradient technique is its simplicity and low computational complexity, especially for large datasets. This is because it selects, at each iteration, one data point at random from the training set, updates the weight iterate using this data point, and continuously repeats the procedure until sufficient convergence is attained. Two popular mechanisms for sampling the training data is to sample with or without replacement. In the first case, some samples may be repeated during the learning process, while in the second case the training data is first reshuffled randomly and the algorithm is run over the reshuffled data one sample at a time.
It has been observed in the literature through experimentation that learning under random reshuffling leads to enhanced performance, not only for the traditional stochastic gradient algorithms but also for variance-reduced implementations. However, no theoretical analysis exists that explains the phenomenon well under constant step-size adaptation. The first part of this dissertation resolves this issue and establishes analytically that, under random reshuffling, convergence is guaranteed to a small neighborhood of the optimizer at a linear rate. The analysis further shows that random reshuffling outperforms uniform sampling by showing that the iterates approach a smaller neighborhood of size $O(\mu^2)$ around the minimizer as opposed to $O(\mu)$. An analytical expression for the steady-state mean-square-error under random reshuffling is derived, which helps clarify in greater detail the differences between sampling with and without replacement. The dissertation also provides an explanation for the periodic behavior that is observed in random reshuffling implementations.
In addition, the dissertation examines the effect of random reshuffling on variance-reduced techniques, which are known to converge to the exact minimizers of empirical risks at linear convergence rates. The existing convergence results assume uniform data sampling with replacement and no proofs or guarantees of convergence exist under random reshuffling. The dissertation provides a theoretical guarantee of linear convergence under random reshuffling for the SAGA algorithm in the mean-square sense by using an argument that is also applicable to other variance-reduced algorithms. A new amortized variance-reduced gradient (AVRG) algorithm is also proposed, which has constant storage requirements compared to SAGA and balanced gradient computations compared to SVRG.
In the second part of the dissertation, we examine learning under large feature spaces, where the feature information is assumed to be spread across agents in a network. In this setting, each agent is assumed to observe part of the feature space. Through local cooperation, the agents are supposed to interact with each other to solve an inference problem and converge towards the global minimizer of the empirical risk. The dissertation proposes two solution methods: one operates in the dual domain and another operates in the primal domain. The dual domain solution builds on gradient boosting techniques, where each agent maintains a local dual variable. By sharing the dual variable with their immediate neighbors through a diffusion learning protocol, all agents are able to match the performance of centralized boosting solutions even when the individual agents only have access to partial information about the feature space. In comparison, the primal domain solution is achieved by combining a dynamic diffusion construction, a pipeline strategy, and variance-reduced techniques. One of the main advantages of the primal solution is that it does not require separate timescales and convergence towards the exact minimizer occurs at a linear rate.