- Main
Asymptotics of Learning in Neural Networks
- Emami, Melikasadat
- Advisor(s): Fletcher, Alyson K
Abstract
Modern machine learning models, particularly those used in deep networks, are characterized by massive numbers of parameters trained on large datasets. While these large-scale learning methods have had tremendous practical successes, developing theoretical means that can rigorously explain when and why these models work has been an outstanding issue in the field. This dissertation provides a theoretical basis for the understanding of learning dynamics and generalization in high-dimensional regimes. It brings together two important tools that offer the potential for a rigorous analytic understanding of modern problems: statistics of high-dimensional random systems and neural tangent kernels. These frameworks enable the precise characterization of complex phenomena in various machine learning problems. In particular, these tools can overcome the non-convex nature of the loss function and non-linearities in the estimation process. The results shed light on the asymptotics of learning for two popular neural network models in high dimensions: Generalized Linear Models (GLMs) and Recurrent Neural Networks (RNNs).
We characterize the generalization error for Generalized Linear Models (GLMs) using a framework called Multi-Layer Vector Approximate Message Passing (ML-VAMP). This framework is a recently developed and powerful methodology for the analytical understanding of estimation problems. It allows us to analyze the effect of essential design choices, such as the degree of over-parameterization, loss function, and regularization, as well as initialization, feature correlation, and a train/test distributional mismatch.
Next, we investigate the restrictiveness of a class of Recurrent Neural Networks (RNNs) with unitary weight matrices. Training RNNs suffers from the so-called vanishing/exploding gradient problem. The unitary RNN is a simple approach to mitigate this problem by imposing a unitary constraint on these networks. We theoretically show that for RNNs with ReLU activations, there is no loss in the expressiveness of the model from imposing the unitary constraint.
Finally, we explore the learning dynamics of RNNs trained under gradient descent using the recently-developed kernel regime analysis. Our results show that linear RNNs learned from random initialization are functionally equivalent to a certain weighted 1D-convolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution and hence shorter memory. Interestingly, the degree of this bias depends on the variance of the transition matrix at initialization.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-