Learning Transformations From Video
- Author(s): Wang, Ching Ming
- Advisor(s): Olshausen, Bruno A
- et al.
Our survival depends on accurate understanding of the environment around us through sensory inputs. One way to achieve this is to build models of the surrounding environment that are able to provide explanations of the data. Statistical models such as PCA, ICA and sparse coding attempt to do so by exploiting the second- and higher-order structures of sensory data. While these models have been shown to reveal key properties of the mammalian sensory system and have been successfully applied in various engineering applications, one shared weakness of these models is that they assume each observation is independent. In reality, there is often a transformational relationship between sensory data observations. Exploiting this relationship allows us to tease apart the causes of the data and reason about the environment. In this thesis, I developed an unsupervised learning framework that attempts to find the translational relationship between data and infer the causes of the observed data.
This dissertation is divided into three chapters. First, I propose an unsupervised learning framework that is able to model the transformations between data points using a continuous transformation model. I highlight the difficulties faced by previous attempts using similar models. I overcome these hurdles by proposing a learning rule that is able to compute the learning updates for an exponential model in polynomial time. I also propose an adaptive inference algorithm that is able to avoid local minima. These improvements make learning transformation possible and efficient.
Second, I perform a detailed analysis of the proposed model. I show that the adaptive inference algorithm is able to simultaneously recover multiple transformation parameters with high accuracy when given synthetic data where the transformation is known. When learned on pairs of images containing affine transformations, the algorithm correctly recovers the transformation operators. The unsupervised learning algorithm is able to discover transformations such as translation, illumination adjustment, contrast enhancement and local deformations when learned on pairs of natural movie frames. I also show that the learned models provide a better description of the underlying transformation both qualitatively and quantitatively compare to commonly used motion models.
Third, I describe a plausible application for the continuous transformation model in video coding. In a hybrid coding scheme, I propose to replace the traditionally used exhaustive search motion model with transformation models learned on natural time-varying images. A detailed analysis of the rate distortion characteristics of different learned models is documented and I show that the learned model improves the performance of traditional motion models in various settings.