In this dissertation, we seek a simple and unified probabilistic model, with power endowed with modern neural networks and computing hardware, that is versatile to model patterns of high dimensionality and complexity in various domains such natural images and natural language. We achieve the goal by studying three families of probabilistic models and proposing a unification of them, which leads to a simple but rather versatile model with rich applications in various domains.
In the modern deep learning era, three families of probabilistic models are widely used to model complex patterns. One family is generator model, which assumes that the observed example is generated by a low-dimensional latent vector via a top-down network and the latent vector follows a non-informative prior distribution. The second family is energy-based model (EBM), which specifies a probability distribution of the observed example, based on an energy function defined on the observed example and parameterized by a bottom-up deep network. The third family is discriminative model which is in the form of classifiers and specifies the conditional probability of the output class label given an input signal.
EBM is expressive but poses challenges in sampling since the energy function defined in the data space has to be highly multi-modal in order to fit the usually multi-modal data distribution, while generator model is relatively less expressive but convenient and efficient in terms of sampling owing to its simple factorized form. We first integrate these two models. In particular, we propose to learn an EBM in the latent space as the prior distribution of the generator model, following the philosophy of empirical Bayes. We call the proposed model as latent space energy-based model, consisting of the energy-based prior model and the top-down generation model. Due to the low dimensionality of the latent space, a simple energy function in latent space can capture regularities in the data effectively. Thus, the resulting model is much more expressive than the original generator model with little cost in terms of model complexity and computational complexity. Also, MCMC sampling in the latent space is much more efficient and mixes better than that in the observed data space. Furthermore, we introduce a principled learning algorithm which is formulated as a perturbation of maximum likelihood learning in terms of both objective function and estimating equation, so that the learning algorithm has a solid theoretical foundation.
We verify the proposed model and learning algorithm on a variety of image and text datasets such as human faces, financial news. The model is able to effectively learn from these high-dimensional and complex datasets. As a result, we can sample faithful and diverse samples from the learned models. We also find that since the model is well-learned, it leads to a discriminative latent space that separates probability densities for normal and anomalous data, naturally making this model a tool for anomaly detection.
Having established the effectiveness of the proposed latent space EBM and learning algorithm, we explore two applications which leverage two respective aspects of latent space EBM. In one application, we exploit the expressiveness of latent space EBM and use it to model molecules which are encoded in a simple format of linear strings. Despite its convenience, models relying on this simple representation tend to generate invalid samples and duplicates. Due to its expressiveness, learned latent space EBM on molecules in this simple and convenient representation is able to generate molecules with validity, diversity and uniqueness competitive with state-of-the-art models, and generated molecules have structural and chemical features whose distributions almost perfectly match those of the real molecules. In another application, we explore the aspect of EBM as a cost function and make a connection with inverse reinforcement learning for diverse human trajectory forecasting. The cost function is learned from expert demonstrations projected into the latent space. To make a forecast, optimizing the cost function leads to a belief vector, which is then projected to the trajectory space by a policy network. The proposed model can make accurate, multi-modal, and social compliant trajectory predictions.
Building on top of the unification of generator model and EBM, we further integrates discriminative model into latent space EBM via an energy term that couples a continuous latent vector and a symbolic one-hot vector. With such a coupling formulation, discrete category can be inferred from the observed example based on the continuous latent vector. Also, the latent space coupling naturally enables incorporation of information bottleneck regularization to encourage the continuous latent vector to extract information from the observed example that is informative of the underlying category. In our learning method, the symbol-vector coupling, the generator network and the inference network are learned jointly. Our model can be learned in either an unsupervised setting or a semi-supervised setting where category labels are provided for a subset of training examples. With the symbol-vector coupling, the learned latent space is well-structured such that the generator generates text with high-quality and interpretability and it performs well on classification tasks with a limited amount of labeled data.