A Picture of the Energy Landscape of Deep Neural Networks
- Author(s): Chaudhari, Pratik Anil
- Advisor(s): Soatto, Stefano
- et al.
This thesis characterizes the training process of deep neural networks. We are driven by two apparent paradoxes. First, optimizing a non-convex function such as the loss function of a deep network should be extremely hard, yet rudimentary algorithms like stochastic gradient descent are phenomenally successful at this. Second, over-parametrized models are expected to perform poorly on new data, yet large deep networks with millions of parameters achieve spectacular generalization performance.
We build upon tools from two main areas to make progress on these questions: statistical physics and a continuous-time point-of-view of optimization. The former has been popular in the study of machine learning in the past and has been rejuvenated in recent years due to the strong correlation of empirical properties of modern deep networks with existing, older analytical results. The latter, i.e., modeling stochastic first-order algorithms as continuous-time stochastic processes, gives access to powerful tools from the theory of partial differential equations, optimal transportation and non-equilibrium thermodynamics.
The confluence of these ideas leads to fundamental theoretical insights that explain observed phenomena in deep learning as well as the development of state-of-the-art algorithms for training deep networks.