In recent decades, deep learning has achieved tremendous successes in supervised learning; however, unsupervised learning and representation learning, i.e., learning the hidden structure of the data without requiring expensive and time-consuming human annotation, remains a fundamental challenge, which probably underlies the gap between current artificial intelligence and the intelligence of a biological brain. In this thesis, we propose novel solutions to the problems in this area. Specifically, we work on deep generative modeling, an important approach of unsupervised learning, and representation learning inspired by structures in the brain.
1. We propose efficient algorithms for learning descriptive models, which are also known as energy-based models (EBMs). Despite an appealing class of generative models with a number of desirable properties, the learning of descriptive models on high-dimensional space remains challenging, which involves computationally expensive Markov chain Monte Carlo (MCMC). To tackle this problem, we propose a multi-grid modeling and sampling method, which learns descriptive models at multiple scales or resolutions and the MCMC sampling follows a coarse-to-fine scheme. This approach enables efficient learning and sampling of descriptive models from large-scale image datasets with small-budget MCMC. Later on, we extend this method to an improved version named diffusion recovery likelihood, where a sequence of descriptive models are proposed and learned on increasingly noisy versions of a dataset. Each descriptive model is trained by sampling from the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level, which further releases the burden of MCMC.
2. We develop dynamic and motion-based generator models which learn semantically meaningful vector representations for spatial-temporal processes such as dynamic textures and action sequences in video data. The models are capable of learning disentangled representations of appearance, trackable motion and intrackable motion in spatial-temporal processes in a fully unsupervised manner. We also propose an efficient learning algorithm named alternating back-propagation through time, which learns the proposed models using online MCMC inference without resorting to auxiliary networks.
3. We propose hybrid generative models that integrate the advantages of different classes of generative models. Specifically, we propose a training algorithm flow contrastive estimation to jointly estimate a descriptive model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. The algorithm is an extension of noise contrastive estimation (NCE) and combines the flexibility of descriptive models and the tractability of flow-based models. We also study another hybrid model where the descriptive model serves as a correction or an exponential tilting of the flow-based model. We show that this model has a particularly simple form in the space of the latent variables of the flow-based model, and MCMC sampling of the descriptive model in the latent space mixes well and traverses modes in the data space.
4. We propose an optimization-based representational model of grid cells. Grid cells exist in the mammalian medial entorhinal cortex (mEC) and are so named because individual neurons exhibit striking firing patterns that form hexagonal grids when the agent (such as a rat) navigates in a 2D open field. To understand how grid cells perform path integration, we conduct theoretical analysis of a general representational model of grid cells where the 2D self-position of the agent is represented by a higher-dimensional vector and the 2D self-motion is represented by a general transformation of the vector. We identify two conditions for the general transformation and demonstrate an important geometric property of the general transformation, i.e., local conformal embedding. We further investigate the simplest transformation, i.e., the linear transformation, and uncover its explicit algebraic and geometric structure as a matrix Lie group of rotation. The model learns significant hexagon patterns of grid cells and is capable of accurate path integration.
5. We extend the representational model of grid cells to an optimization-based representational model of V1 simple cells. V1 stands for the primary visual cortex in the mammalian brain, and V1 simple cells are highly specialized for low-level motion perception and pattern recognition. We propose a representational model of V1 simple cells which couples the following two components: (1) the vector representations of local contents of images and (2) the matrix representations of local pixel displacements caused by the relative motions between the agent and the objects in the 3D scene. The model can learn Gabor-like tunings of V1 simple cells and similar to V1 simple cells, the learned adjacent neurons have quadrature-phase relations.