Activity analysis is a field of computer vision which has shown great progress in the past decade. Starting from simple single person activities, research in activity recognition is moving towards more complex scenes involving multiple
objects and natural environments. The main challenges in the task include being able to localize and recognize events in a video and deal with the large amount of variation in viewpoint, speed of movement and scale.
Surveillance videos typically consist of wide areas being monitored through a static camera. Often, they contain long duration sequences of activities which occur at different spatio-temporal locations and can involve multiple people acting simultaneously. Many times, the activities have contextual relationships with one other. Although context has been studied in the past for the purpose of activity recognition to a certain extent, the use of context in recognition of activities in such challenging environments is relatively unexplored. The primary focus of the work is in recognition of activities in continuous videos.% Contextual information is captured using graphical models. There are two models proposed - the Markov random field model and the conditional random field.
We discuss three methods of activity recognition in continuous videos. In the first, we demonstrate the different components of analysis involved in labeling activities in wide-area continuous videos, such as elimination of background noise, identification of motion patterns which correspond to interesting activities and the task of activity modeling. We propose to do this using an optical flow based framework. We discuss the limitations of this work, which can be overcome with the addition of context.
Next, we propose a context-based approach for activity recognition using graphical models. We assume that the location of activities are identified using existing techniques. The task of the graphical model is therefore to label these identified regions using context. Given a collection of videos and a set of weak classifiers for individual activities, the spatio-temporal relationships between activities are represented as probabilistic edge weights in a Markov random field. This model provides a generic representation for an activity sequence that can extend to any number of objects and interactions in a video. We show that the recognition of activities in a video can be posed as an inference problem on the graph. We conduct experiments on the publicly available VIRAT dataset to demonstrate the improvement in recognition accuracy using our proposed model as opposed to recognition using state-of-the-art features on individual activity regions.
We then present a unified framework to track multiple people, as well localize and label their activities, in complex long-duration video sequences. To do this, we focus on two aspects - the influence of tracks on the activities performed by the corresponding actors and the structural relationships across activities. We propose a two-level hierarchical graphical model which learns the relationship between tracks, relationship between tracks and their corresponding activity segments, as well as the spatiotemporal relationships across activity segments. Such contextual relationships between tracks and activity segments are exploited at both the levels in the hierarchy for increased robustness.
Finally, we suggest how the structure learning can be performed in a graphical model which performs activity recognition. While a continuous video consists of several activities, the contextual relationships between these activities are relatively sparse. We propose a method which aims to discover these sparse relationships using an L1-regularization based automatic structure discovery of a graphical model representing the video. Sparsity is imposed on the edges of the graph so as to model a sparse set of relationships.