Nijkamp, Erik Lennart

Learning Descriptive and Generative Models with Short-Run MCMC

2021

Nijkamp, Erik Lennart
Advisor(s): Zhu, Song-Chun

Abstract

What is vision? The mystery of how the visual cortex extracts abstract concepts from a plethora of visual sensory stimuli has captivated pioneers such as Herrmann von Helmholtz and David Marr for the past century. \textit{Helmholtz} states, what we see is the solution to a computational problem; our brains compute the most likely causes for the photon absorptions within our eyes. In his monumental work ``Vision'', \textit{Marr} conceptualizes the process of vision as a set of representations, starting from a description of the input image and culminating with a description of three-dimensional objects in the surrounding environment. \textit{David Bryant Mumford} proposes hierarchical Bayesian inference as a means to understand the visual cortex. In the context of predictive coding theory, Mumford argues that the function of the hierarchical structure in the cortex is to reconcile representations and predictions of sensory stimuli at multiple levels. The assumption is that the dynamics of neural activity is guided towards minimizing the discrepancy or error between the input representation at each level and the prediction originating from a higher-level representation. \textit{Song-Chun Zhu} and \textit{Ying Nian Wu} propose a holistic realization of Marr's paradigm with rigorous statistical modeling in their work ``Computer Vision - Statistical Models for Marr's Paradigm''.

We follow in these footsteps towards a realization of Marr's paradigm and frame vision as the problem of Bayesian posterior inference in a latent-variable model. Following predictive coding theory, we believe that higher-level representations emerge from a reconstruction of the sensory stimuli as an inference process in a top-down model, for which inference may be amortized in a bottom-up model. Notably, the posterior inference is in the form of a Markov chain Monte Carlo (MCMC) sampling process which maintains a set of most probably candidate solutions. This approach allows to naturally explain observations such as resolving ambiguity in poorly handwritten text and explains phenomena such as hallucination. In this sense, ``seeing'' itself is merely an illusion. The dominance of top-down processing within the cortex is not only supported by observations such as these, but also by findings in neuroscience concerning the structure of the cortex.

While such generative models in the form of top-down latent-variable architectures are desirable in the context of vision as a Bayesian inference problem, the maximum-likelihood learning of the model parameters poses a severe computational challenge. As such, we propose \textit{short-run MCMC} as a computationally efficient means of sampling for synthesis in descriptive models and posterior inference in latent-variable models. The bottleneck for the likelihood-based learning of generative models is the intractable computation of expectations, which usually require expensive Markov chain Monte Carlo (MCMC), whose convergence can be highly problematic. The main motivation of the dissertation is to get around this bottleneck. The manuscript follows the journey of(1) Energy Landscape Mapping of the macroscopic structure of the learned energy potential function of the DeepFRAME model, (2) Anatomy of MCMC-based MLE Learning towards understanding MCMC-based maximum likelihood learning of the bottom-up DeepFRAME model, (3) Short-run MCMC for Synthesis which is the discovery of short-run MCMC for sampling from such distribution in a computationally efficient manner, (4) Short-run MCMC for Inference as a means of posterior sampling in top-down, hierarchical, latent-variable models, and, (5) Latent Space Descriptive Models which culminate with the lifting of the DeepFRAME into the latent-space of top-down models for which the Markov chains enjoy mixing.

The research culminates in the latent space descriptive model, in which a descriptive model in latent space stands on the shoulders of a latent-variable model for which maximum likelihood learning involves sampling from both the prior and posterior distribution in the form of short-run MCMC. In the language of Zhu and Wu, the top-down model generates textons, and the descriptive model regulates the perceptual organization of textons, or describes the Gestalt law of textons.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Learning Descriptive and Generative Models with Short-Run MCMC