Towards Detecting and Describing Objects: Object Detection, Parsing and Human Pose Estimation
- Author(s): Chen, Xianjie
- Advisor(s): Yuille, Alan L
- et al.
Detecting and describing objects is one of the fundamental challenges in computer vision. Teaching computers to find and parse objects in the images is an interesting artificial intelligence problem in its own right, and a working technique also has enormous potential to benefit a lot of other computer vision tasks. In this thesis, we focus on three highly correlated tasks for detecting and describing objects, i.e., object detection, object parsing and human pose estimation, and propose a series of novel methods for these tasks.
The first step to recognize an object is arguably to localize it. We start from studying the role of context for object detection and semantic segmentation in the wild. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category, and propose a novel deformable part-based contextual reasoning method. We show that this method significantly helps in detecting objects.
Parsing objects into semantic body parts is important for understanding them further. We propose novel graphical model based approaches to describe objects in terms of the semantic body parts. In order to study different representation learning methods, we also study an end-to-end Deep Convolutional Neural Networks (DCNNs) based method to model the relationship between object and body parts in a holistic manner. For training and evaluating our methods, we provide fully annotated object parts for PASCAL VOC 2010.
Human is one of the most important objects. It is crucial to teach computers to understand different poses of human. We present a method for estimating human pose based on a graphical model with novel pairwise relations that make adaptive use of local image measurements. We make novel use of the DCNNs and combine their statistical power with the representational flexibility of graphical models. To parse humans when there is significant occlusion. We further propose a novel method for learning occlusion cues, and exploit the fact that occlusions often occur in regular patterns. We evaluate these models on popular benchmark datasets and show significant performance improvements over the state of the arts.