UC San Diego
Visual Understanding of Complex Human Behavior via Attribute Dynamics
- Author(s): Li, Weixin
- Advisor(s): Vasconcelos, Nuno
- et al.
Visual understanding of human behavior in video sequences is one of the fundamental topics in computational vision. Being a sequential signal by nature, most critical insights of human activity can only be perceived via modeling the temporal structure. Despite an intuitive proposition, this task is non-trivial to accomplish. One of the most significant obstacles comes from the enormous variability and distinct properties of temporal structure at different levels of the human motion hierarchy, which spans a wide range of collectiveness, time and space, semantic granularity, and so forth. This has posed a rigorous challenge for a solution that is supposed to be capable of simultaneously capturing the instantaneous movements, encoding the mid-level evolution patterns, coping with to long-term non-stationarity or content drifts, and being invariant to intra- class variation and other visual noise.
While most of the previous works in the literature focus on addressing some aspects of this problem, we aim to develop a unified framework to handle them all for complex human activity analysis. Specifically, we propose to model the temporal structure of human behavior on a robust, stable yet general representation platform that encodes some semantically meaningful concepts (or attributes). This platform bridges the gap between low-level visual feature and the high-level logical reasoning, bringing in benefits such as better generalization, knowledge transfer, and so forth. While attributes take care of abstracting semantic information from short-term motion in low-level visual signal, the dynamic model focuses on charactering the mid-range evolution patterns in this space. To cope with long-term non-stationarity and intra-class variation for complex events, we derive two encoding schemes that capture the zeroth and first order statistics of the attribute dynamics in video snippets, instead of precisely characterizing the whole sequence, which is prone to over-fitting due to the sparse nature of complex event instantiation.
The proposed framework is implemented via several novel models, together with the corresponding technical tools for statistical inference, parameter estimation, similarity measure, encoding statistics at the model manifold, and so on. In particular, a dynamic model is proposed to capture the evolution pattern in sequential binary data, denoted the binary dynamic system (BDS), which consists of a binary principal component analysis for modeling appearance and Gauss- Markov process to encode dynamics. A mixture model is further derived from BDS to characterize multiple types of dynamics in a large data corpus. Based on variational methods, an accurate and efficient approximate inference scheme is developed for the state posterior to handle the intrinsic intractability; and a variational expectation-maximization algorithm is also derived for parameter estimation. Through these tools, measurements that quantify the similarity or dissimilarity of two binary sequences are devised from the perspective of control theory, information geometry, and kernel methods. Besides, approaches to en- code the statistics of sequential binary data in the manifold of statistical models are proposed, resulting in the bag-of-words for attribute dynamics (BoWAD) and vector of locally aggregated descriptor for attribute dynamics (VLADAD).
Empirical study on challenging tasks of complex human activity analysis justifies the effectiveness of the proposed framework. Our solution not only produces the state-of-the-art results for event detection, but also enables recount- ing that provides the visual evidence anchored over time in the video for the prediction, and facilitates tasks like semantic video segmentation, content based video summarization, and so forth.