Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Recognizing and Segmenting Objects in the Presence of Occlusion and Clutter

Creative Commons 'BY' version 4.0 license
Abstract

One of the fundamental problems of computer vision is to detect and localize objects

such as humans and faces in images. Object detection is a building block for a wide

range of applications including self-driving cars, robotics and face recognition. Though

significant progress has been achieved in these tasks, it is still challenging to obtain

robust results in unconstrained images. Real world scenes usually contain more than one

object and it is very likely that some parts of an object are occluded by other objects

in the scene. To tackle occlusion, image features generated by occlusion should be explicitly

modeled rather than treated as noise. In this thesis, a deformable part model for detection

and keypoint localization is introduced that explicitly models part occlusion. The proposed

model structure makes it possible to augment positive training data with large numbers of

synthetically occluded instances. This allows us to easily incorporate the statistics of

occlusion patterns in a discriminatively trained model. To exploit bottom-up cues such as

occluding contours and image segments, we extend the proposed model to utilize bottom-up

class-specific segmentation in order to jointly detect and segment out the foreground

pixels belonging to the object.

In these approaches, a detector for a single object category is trained which operates

independently of other detections in the scene. An appealing alternative approach for

detection in cluttered images is to move from single object detection to whole-image

parsing. The presence of occlusion can then be “explained away” by the presence of an

occluding object. We model multi-object detection by classifying each pixel of the image

(semantic segmentation) using Convolutional Neural Network. CNN architectures have terrific

recognition performance but rely on spatial pooling which makes it difficult to adapt them

to tasks that require dense, pixel-accurate labeling. We demonstrate that while the

apparent spatial resolution of convolutional feature maps is low, the high-dimensional feature

representation contains significant sub-pixel localization information. We describe a

multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip

connections from higher resolution feature maps and multiplicative gating to successively

refine segment boundaries reconstructed from lower-resolution maps. We demonstrate that this approach yields state-of-the-art semantic segmentation results without resorting to more

complex random-field inference or instance detection driven architectures.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View