On Priors for Bayesian Neural Networks
- Author(s): Nalisnick, Eric Thomas
- Advisor(s): Smyth, Padhraic
- et al.
Deep neural networks have bested notable benchmarks across computer vision, reinforcement learning, speech recognition, and natural language processing. However, neural networks still have deficiencies. For instance, they have a penchant to over-fit, and large data sets and careful regularization are needed to combat this tendency. Using neural networks within the Bayesian framework has the potential to ameliorate or even solve these problems. Shrinkage-inducing priors can be used to regularize the network, for example. Moreover, test set evaluation is done by integrating out uncertainty and using the posterior predictive distribution. Marginalizing the model parameters in this way is not only a natural regularization mechanism but also enables uncertainty quantification---which is increasingly important as machine learning is deployed in ever more consequential applications.
Bayesian inference is characterized by specification of the prior distribution, and unfortunately, choosing priors for neural networks is difficult. The primary obstacle is that the weights have no intuitive interpretation and seemingly sensible priors can induce unintended artifacts in the distribution on output functions. This dissertation aims to help the reader navigate the landscape of neural network priors. I first survey the existing work on priors for neural networks, isolating some key themes such as the move towards heavy-tailed priors. I then describe my own work on broadening the class of priors applicable to Bayesian neural networks. I show that introducing multiplicative noise to the hidden layer computation induces a Gaussian scale mixture prior, suggesting links between dropout regularization and previous work on heavy-tailed priors. I then turn towards priors with frequentist properties. Reference priors cannot be analytically derived for neural networks so I propose an algorithm to approximate them. Similarly, it is hard to derive priors that make the model invariant under certain input transformations. To make progress, I use an algorithm inspired by my work on objective priors to learn a prior that makes the model approximately invariant. Lastly, I describe how to give Bayesian neural networks an adaptive width by placing stick-breaking priors on their latent representation. I end the dissertation with a discussion of open problems, such as incorporating structure into priors while still maintaining efficient inference.