Learning Task-sufficient Representation of Video Dynamics
- Author(s): Bei, Xinzhu
- Advisor(s): Soatto, Stefano
- et al.
This dissertation provides a generic solution to model dynamic systems whose hidden state and the transition model are unknown in practice. We build the task-sufficient filtering framework to maintain a finite, abstract, and learnable representation (memory) that is sufficient to update itself, casually and iteratively, and to predict downstream task variables of interest. We show our realization of the framework by recurrent neural networks as universally-approximating function classes to imitate the functionality of a state transition model and a task prediction model.
In addition, we provide practical methodologies to impose generic priors of the physical scene on the hidden representation. We leverage (lower-level) topological and regularity constraints of natural images, such as occlusion relations, to define object regions. Hence, we capture the motion priors associated with different (higher-level) semantic categories, that are combined to describe the dynamics of the whole scene.
The framework takes videos as sequential input streams and produces representations of video dynamics. We show the success of our framework by applying it to solve real-world computer vision tasks, including generic object tracking and video prediction. The learned dynamic models are extensible to multiple circumstances requiring a dynamically and casually updated memory with uncertainty.