Statistical Learning with Neural Networks Trained by Gradient Descent
- Author(s): Frei, Spencer
- Advisor(s): Wu, Ying Nian
- Gu, Quanquan
- et al.
In this thesis, we theoretically analyze the ability of neural networks trained by gradient descent to learn. The learning problem consists of an algorithmic component and a statistical component. The algorithmic question concerns the underlying optimization problem: given samples from a distribution, under what conditions can a neural network trained by gradient descent efficiently minimize the empirical risk for some loss function defined over these samples? As the underlying optimization problem is highly non-convex, standard tools from optimization theory are not applicable and thus a novel analysis is needed. The statistical question concerns the generalization problem: supposing gradient descent is successful at minimizing the empirical risk, under what conditions does this translate to a guarantee for the population risk? Contemporary neural networks used in practice are highly overparameterized and are capable of minimizing the empirical risk even when the true labels are replaced with random noise, and thus standard uniform convergence-based arguments will fail to yield meaningful guarantees for the population risk for these models.
We begin our thesis by analyzing the simplest nontrivial neural network possible: a single neuron with a nonlinear activation function under the squared loss. Even this simple network induces a highly non-convex optimization problem. By showing that an approximate surrogate risk is minimized throughout the gradient descent trajectory, we show that gradient descent is able to learn single neurons for a large class of nonlinear activation functions. Our results hold in the agnostic setting, implying that gradient descent succeeds even when the model is mis-specified.
We continue our analysis of the single neuron by examining the classification setting, where the loss of interest is the zero-one loss rather than the squared loss. As the decision boundary for single neurons in the classification setting is identical to that of linear classifiers for typical activation functions, we focus on the linear classifier setting. This reduces the problem to that of learning halfspaces with noise, a long-studied problem in computational learning theory with well-established computational hardness constraints on the learning problem due to the non-convexity of the zero-one loss. We establish connections between minimizers of convex surrogates of the zero-one loss and minimizers of the zero-one loss itself to develop the first positive guarantees for gradient descent on convex loss functions for learning halfspaces with agnostic noise.
We then establish guarantees for learning halfspaces with agnostic noise when using overparameterized SGD-trained two layer nonlinear neural networks. Our analysis requires both overcoming the non-convexity of the underlying optimization problem as well as avoiding generalization bounds that become vacuous when the number of parameters in the neural network becomes large.
In our final contribution, we derive generalization bounds for overparameterized deep residual networks trained by gradient descent. Our techniques leverage a recently developed correspondence between large, overparameterized neural networks and the tangent kernels of their infinite width approximations known as the neural tangent kernel.