The ubiquity of videos requires effective content extraction tools to enable practical applications automatically. Computer vision research focuses on bridging the gap between raw data (pixel values) and video semantics, but information based only on image values are not sufficient, due to the visual ambiguities caused by varied camera characteristics, frequent occlusions, low resolution, large intra-class and small inter-class variation among object/activity/event classes, etc.
In this dissertation, we develop methodologies with new machine learning and statistical optimization techniques to model high-level context to mitigate visual ambiguity, thus improving performance on several real-world computer vision tasks. We first describe the usage of social grouping context, supported by sociology research, to improve intra-camera multi-target tracking, inter-camera multi-target tracking, and head pose estimation in video. For single-camera tracking, social grouping context regularizes existing tracking methods in a principled way, and provides a natural solution to go beyond traditional tracking with Markovian assumptions. For multi-camera tracking, social grouping context effectively mitigates visual ambiguities from cameras with different viewpoints and lighting conditions. Both problems unify under a probabilistic formulation, and we provide a novel effective routine for the constrained nonlinear optimization problem that jointly conducts tracking and social grouping. We also show that social grouping context helps head pose estimation, which is challenging due to the small sized head images in typical high-angle surveillance videos. A Conditional Random Field (CRF) framework is used to perform group head pose labeling, in which interactions among group members are encoded. The model generalizes existing methods that only focus on individuals, and allows exact learning and inference.
We further explore temporal context for an important computer vision task, i.e. video event localization and recognition. We study a new model from machine learning, called Piecewise-constant Conditional Intensity Model (PCIM), which is able to model complex dependencies in general event streams. We first develop a general-purpose inference algorithm for PCIM by designing an auxiliary Gibbs sampler. The sampler alternates between sampling a finite set of auxiliary virtual events with adaptive rates, and performing an efficient forward-backward pass at discrete times to generate samples. We show that our sampler is the first in literature to successfully perform inference tasks in both Markovian and non-Markovian PCIM models, and can be employed in Expectation-Maximization-based parameter estimation and structural learning for PCIM with partially observed data. We then show that the problem of video event localization and recognition can be modeled as the inference of high-level events given low-level observations in a PCIM. Our approach provides a principled way to learn an interpretable model that utilizes dependencies among events (both high-level and low-level), while existing methods mainly focus on local information. We observe that temporal context helps to mitigate visual ambiguities, especially between events with similar local appearances.