Generative Modeling and Unsupervised Learning in Computer Vision
Developing statistical models and associated learning algorithms for the rich visual patterns in natural images is of fundamental importance for computer vision. More importantly, the endeavor has the potential to enrich our treasured collections of statistical models and expand the already vast reach of machine learning methodologies. Generative models enable us to learn useful features and representations from the natural images in an unsupervised manner. The learned features and representations can be more interpretable and explicit than those learned by the discriminative models, especially if the learned models are sparse. The objective of this dissertation is to learn probabilistic generative models for representing visual patterns in natural images.
In this dissertation, we first develop a sparse FRAME model as a generative model for representing natural image patterns. The model is an inhomogeneous and sparsified version of the original FRAME (Filters, Random field, And Maximum Entropy) model. More specifically, it is a probability distribution defined on the image space which combines wavelet sparse coding and Markov random field modeling. We propose two different algorithms to learn this model. The first is a two-stage algorithm that initially selects the wavelets by a shared sparse coding algorithm and then estimates the weight parameters by maximum likelihood via stochastic gradient ascent. The second approach utilizes a single-stage algorithm that uses a generative boosting method combined with a Gibbs sampler on the reconstruction coefficients of the selected wavelets. Our experimental results show that the proposed sparse FRAME model can not only learn to generate realistic images of a wide variety of image patterns, but can also be used for object detection, clustering, codebook learning, bag-of-word image classification, and domain adaptation.
We further propose a hierarchical version of FRAME models that we call generative ConvNet. The probability distribution of the generative ConvNet model is in the form of exponential tilting of a reference distribution, and the exponential tilting is defined by ConvNet that involves multiple layers of liner filtering and non-linear transformation. Assuming re-lu non-linearity and Gaussian white noise reference distribution, we show that the generative ConvNet model contains a representational structure with multiple layers of binary activation variables. The model is piecewise Gaussian, where each piece is determined by an instantiation of the binary activation variables that reconstruct the mean of the Gaussian piece. The Langevin dynamics for synthesis is driven by the reconstruction error, and the corresponding gradient descent dynamics converges to a local energy minimum that is auto-encoding. As for learning, we show that the contrastive divergence learning tends to reconstruct the observed images. We also generalize the spatial generative ConvNet to model dynamic textures by adding the temporal dimension. The spatial-temporal generative ConvNet consists of multiple layers of spatial-temporal filters to capture the spatial-temporal patterns in the dynamic textures. Finally, we show that the maximum likelihood learning algorithm can generate not only vivid natural images but also realistic dynamic textures.
We lastly investigate a connection of the proposed models to auto-encoders. We show that the local modes of both the sparse FRAME model and the generative ConvNet are represented by auto-encoders, with explicit encoding of the data in terms of filtering operations, and explicit decoding that generates the data in terms of the basis functions that corresponds to the filters. We call these auto-encoders the Hopfield auto-encoders because they describe the local energy minima of the models. We develop learning algorithms to learn those models by fitting Hopfield auto-encoders. We show that it is possible to select wavelets and estimate weight parameters for sparse FRAME models by fitting Hopfield auto-encoders. Moreover, meaningful dictionaries of filters can be obtained by learning hierarchical Hopfield auto-encoders for generative ConvNet. Without MCMC, the Hopfield auto-encoder has the potential to tremendously accelerate learning that is crucial for big data.