Contextual Visual Recognition from Images and Videos
Object recognition from images and videos has been a topic of great interest in the computer vision community. Its success directly impacts a wide variety of real-world applications; from surveillance and health care to self-driving cars and online shopping.
Objects exhibit organizational structure in their real-world setting (Biederman et al., 1982). Contextual reasoning is part of human's visual understanding and has been modeled by various efforts in computer vision in the past (Torralba, 2001). Recently, object recognition has reached a new peak with the help of deep learning. State-of-the-art object recognition systems use convolutional neural networks (CNNs) to classify regions of interest in an image. The visual cues extracted for each region are limited to the content of the region and ignore the contextual information from the scene. So the question remains, how can we enhance convolutional neural networks with contextual reasoning to improve recognition?
Work presented in this manuscript shows how contextual cues conditioned on the scene and the object can improve CNNs' ability to recognize difficult, highly contextual objects from images. Turning to the most interesting object of all, people, contextual reasoning is a key for the fine-grained tasks of action and attribute recognition. Here, we demonstrate the importance of extracting cues in an instance-specific and category-specific manner tied to the task in question. Finally, we study motion which captures the change in shape and appearance in time and is a way to extract dynamic contextual cues. We show that coupling motion with the complementary signal of static visual appearance leads to a very effective representation for action recognition from videos.