Training models that are multi-layer or recursive, such as neural networks or dynamical system models, entails solving a nonconvex optimization problem in machine learning. These nonconvex problems are usually solved with iterative optimization algorithms, such as the gradient descent algorithm or any of its variants. Once an iterative algorithm is involved, the dynamics of this algorithm will become critical in determining the specific solution obtained for the optimization problem. In this dissertation, we use tools from nonlinear and adaptive control theory to analyze and understand how the dynamics of the training procedures affects the solutions obtained, and we synthesize new methods to facilitate optimization, to provide robustness for the trained models, and to help explain observed outcomes in a more accurate way.
By studying Lyapunov stability of the fixed points of the gradient descent algorithm, we show that this algorithm can only yield a solution from a bounded class of functions when training multi-layer models. We establish a relationship between the learning rate of the algorithm and the Lipschitz constant of the function estimated by the multi-layer model. We also show that keeping every layer of the model close to the identity operation boosts the stability of the optimization algorithm and allows the use of larger learning rates.
We use a classical concept in system identification and adaptive control, namely, the persistence of excitation, to study the robustness of multi-layer models. We show that when trained with the gradient descent algorithm, robust estimation of the unknown parameters in a multi-layer model requires not only the richness of the training data, but also the richness of the hidden-layer activations throughout the training procedure. We derive necessary and sufficient richness conditions for the signals in each layer of the model, and we show that these conditions are usually not satisfied by models that have been naively trained with the gradient descent algorithm, since the signals in their hidden layers become low-dimensional during training. By revisiting the common regularization methods for single-layer models, reinterpreting them in terms of enhancing the richness of the training data, and drawing an analogy for multi-layer models, we design a training mechanism that provides the required richness for the signals in the hidden-layers of multi-layer models. This training procedure leads to similar margin distributions for the training and test data for a neural network trained for a classification task, indicating its effectiveness as a regularization method.
We study the dynamics of the gradient descent algorithm on dynamical systems as well. We show that when learning the unknown parameters of an unstable dynamical system, the observations taken from the system at different times influence the dynamics of the gradient descent algorithm in substantially different degrees. In particular, the observations taken from the system near the end of the time horizon imposes an exponentially strict constraint on the learning rate that could be used for the gradient descent algorithm, whereas such small learning rates cannot recover the stable modes of the system. We show that warping the observations of the system in a particular way and creating risk-sensitivity in the observations remedies this imbalance and allows learning both the stable and the unstable modes of a linear dynamical system.
The results in this dissertation lay out the strong connection between the machine learning problems involving nonconvex optimization and the classical tools in nonlinear and adaptive control theory. While analyses with Lyapunov stability and persistence of excitation are able to help understand and enhance the machine learning models trained with iterative optimization algorithms, the major effect of altering the training dynamics on multi-layer machine learning models indicates the potential for improving system identification for dynamical systems by designing alternative loss functions.