Deep learning has witnessed fast growth and wide application in recent years. One of the essential properties of the modern deep learning model is that it is sufficiently over-parameterized, i.e., it has much more learnable parameters than the number of training examples. The over-parameterization, on one hand, is the core of the superior approximation/representation ability of the neural network function, while, on the other hand, could lead to severe over-fitting issues, according to the conventional wisdom in learning theory. However, this is not consistent with the empirical success in deep learning, where the neural network model, trained by standard optimization algorithms (e.g., stochastic gradient descent, Adam, etc.), can not only perfectly fit the training data (i.e., finding the global solution of the training objective), but also generalizes well on the test data. This dissertation seeks to understand and explain this phenomenon by carefully characterizing the role of optimization algorithms in learning over-parameterized models.
We begin the dissertation by studying arguably the simplest model: over-parameterized linear regression problems. In particular, we consider a class of SGD algorithms and prove problem-dependent generalization error bounds accordingly. Based on the established generalization guarantees, we will further characterize the sufficient conditions on the least square problem itself (e.g., conditions on data distribution and ground-truth model parameters) such that the SGD algorithm can generalize. Moreover, motivated by the recent work on the implicit regularization of SGD, we also provide a complete comparison between the SGD solution and the solution of regularized least square (i.e., ridge regression). We demonstrate the benefit of SGD compared to ridge regression for a large class of the least square problems classes, which partially explains the implicit regularization effect of SGD.
In the second part, we study the effect of optimization algorithms for learning over-parameterized neural network models. Different from linear models that their optimization guarantees can be easily established, studying the optimization in training deep neural networks is challenging since the training objective is nonconvex or even nonsmooth. Therefore, we first study the optimization in training over-parameterized neural network models and establish global convergence guarantees for both GD and SGD under mild conditions on the data distribution. Based on the optimization analysis, we further establish an algorithm-dependent generalization analysis for SGD/GD. We show that if the data distribution admits certain good separation properties, a deep ReLU network with polylogarithmic width can be successfully trained with a global convergence guarantee and good generalization ability. Finally, we compare the generalization ability of two different optimization algorithms in learning over-parameterized neural networks: GD and Adam, and show that Adam and GD exhibit different algorithmic biases, which consequently leads to different solutions that have different generalization performances.
The works covered in this dissertation form exploration in understanding the role of optimization algorithms in learning over-parameterized models, which is an incomplete collection of the recent advances in deep learning theory but shed light on a broader class of future directions in deep learning research.