One of the fundamental problems of computer vision is to detect and localize objects
such as humans and faces in images. Object detection is a building block for a wide
range of applications including self-driving cars, robotics and face recognition. Though
significant progress has been achieved in these tasks, it is still challenging to obtain
robust results in unconstrained images. Real world scenes usually contain more than one
object and it is very likely that some parts of an object are occluded by other objects
in the scene. To tackle occlusion, image features generated by occlusion should be explicitly
modeled rather than treated as noise. In this thesis, a deformable part model for detection
and keypoint localization is introduced that explicitly models part occlusion. The proposed
model structure makes it possible to augment positive training data with large numbers of
synthetically occluded instances. This allows us to easily incorporate the statistics of
occlusion patterns in a discriminatively trained model. To exploit bottom-up cues such as
occluding contours and image segments, we extend the proposed model to utilize bottom-up
class-specific segmentation in order to jointly detect and segment out the foreground
pixels belonging to the object.
In these approaches, a detector for a single object category is trained which operates
independently of other detections in the scene. An appealing alternative approach for
detection in cluttered images is to move from single object detection to whole-image
parsing. The presence of occlusion can then be “explained away” by the presence of an
occluding object. We model multi-object detection by classifying each pixel of the image
(semantic segmentation) using Convolutional Neural Network. CNN architectures have terrific
recognition performance but rely on spatial pooling which makes it difficult to adapt them
to tasks that require dense, pixel-accurate labeling. We demonstrate that while the
apparent spatial resolution of convolutional feature maps is low, the high-dimensional feature
representation contains significant sub-pixel localization information. We describe a
multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip
connections from higher resolution feature maps and multiplicative gating to successively
refine segment boundaries reconstructed from lower-resolution maps. We demonstrate that this approach yields state-of-the-art semantic segmentation results without resorting to more
complex random-field inference or instance detection driven architectures.