Object detection and segmentation are important computer vision problems that have applications in several domains such as autonomous driving, virtual and augmented reality systems, human-computer interaction etc. In this dissertation, we study how to improve object detection and segmentation by utilizing different contexts. Context refers to one of many application scenarios such as (i) video frames for consistent prediction over time, (ii) specific domain knowledge such as human keypoints for person segmentation, and (iii) implementation context aiming for efficiency in embedded systems.
Temporal Context of Videos: Video data understanding has drawn considerable interest in recent times as a result of access to huge amount of video data and success in image-based models for visual tasks. However, motion blur, compression artifacts cause apparently consistent video signals to produce high temporal variation on frame-level output for vision tasks such as object detection or semantic segmentation. We study and propose efficient early, and high-level visual processing algorithms by leveraging video content in a streaming fashion. We show how to fuse motion and color to achieve improved streaming hierarchical supervoxels. As a high-level visual task, we propose consistent and efficient video object detection using Convolutional Neural Network (CNN) by clustering video object proposals and propagating object class labels through the videos. Next, we propose an end-to-end framework for learning video object detection through Recurrent Neural Network (RNN) by posing video as a time series. We also present a post-processing framework for improving semantic segmentation in videos.
Domain Knowledge Context for Segmentation: Person instance segmentation is a promising research frontier for a range of applications such as human-robot interaction, sports performance analysis, and action recognition. Human keypoints are a well-studied representation of people. We explore how to use keypoint models to improve instance-level person segmentation in constrained and unconstrained environments with or without training.
Efficiency Context for Embedded Implementation: To make an object detector system amenable for embedded implementation, we propose a low-complexity fully convolutional neural network. Additionally, we employ 8-bit quantization on the learned weights. As a mobile use case, we choose face detection. The results show that the proposed method achieves comparative accuracy comparing with the state-of-the-art CNN-based object detection methods while reducing the model size by 3x and memory-BW by 3-4x comparing with its strongest baseline.