Toward Understanding the Dynamics of Over-parameterized Neural Networks
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Toward Understanding the Dynamics of Over-parameterized Neural Networks

Abstract

The practical applications of neural networks are vast and varied, yet a comprehensive understanding of their underlying principles remains incomplete. This dissertation advances the theoretical understanding of neural networks, with a particular focus on over-parameterized models. It investigates their optimization and generalization dynamics and sheds light on various deep-learning phenomena observed in practice. This research deepens the understanding of the complex behaviors of these models and establishes theoretical insights that closely align with their empirical behaviours across diverse computational tasks.

In the first part of the thesis, we analyze the fundamental properties of over-parameterized neural networks and we demonstrate that these properties can lead to the success of their optimization. We show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity. The transition to linearity is characterized by the networks converging to their first-order Taylor expansion of parameters as their ``width'' approaches infinity. The width of these general networks is characterized by the minimum indegree of their neurons, except for the input and first layers. We further demonstrate that the property of transition to linearity plays an important role in the success of the optimization of over-parameterized neural networks.

In this second part of the thesis, we investigate the modern training regime of over-parameterized neural networks, particularly focusing on the large learning rate regime. While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. We show that recently proposed Neural Quadratic Models can exhibit the ``catapult phase''~\cite{lewkowycz2020large} that arises when training such models with large learning rates. We then empirically show that the behaviour of neural quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models can be an effective tool for analysis of neural networks.

Moreover, we extend the analysis of catapult dynamics to stochastic gradient descent (SGD). We first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with SGD. We provide evidence that the spikes in the training loss of SGD are caused by catapults. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults increase feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance. Overall, by integrating theoretical insights with empirical validations, this dissertation provides a new understanding of the complex dynamics governing neural network training and generalization.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View