Learning to Detect and Segment Objects in Images
- Author(s): Hallman, Sam Nathan;
- Advisor(s): Fowlkes, Charless C;
- et al.
Two foundational and long-standing problems in computer vision are to detect and segment objects in images. The former problem focuses on finding certain classes of objects (e.g., humans), while the latter problem is concerned with obtaining a complete, pixelwise labeling of the scene, such that the objects of interest comprise different segments in the final labeling. Despite the obvious similarity between these two problems, they have typically been approached using techniques that have little in common.
In the first half of this thesis, we do away with the artificial distinction between segmentation and detection, and demonstrate the utility of producing segmentations from object detections directly in a feed-forward manner. Because a successful segmentation algorithm must respect bottom-up grouping cues, we begin by introducing a trainable edge detector that advances the state of the art in boundary detection, yet is extremely fast to train. Then, to inject top-down knowledge from object models into the segmentation process, we introduce a simple probabilistic model that captures the shape, appearance and depth ordering of a collection of detections in an image. As an application of this general idea, we release software that detects and segments cell nuclei in 3D confocal microscopy images, using detections as seeds to the segmentation pipeline.
When the underlying object detector is trained on a separate (often larger) dataset, there is usually room to tune it to perform better on new datasets. In the extreme, models sometimes do not generalize at all; this is often the case in applications to biology. To address this problem, we design our cell detector so that it can easily be retrained by an end user by labeling a few test images in order to adapt it to a new dataset. For modern techniques built to work on natural images, the usual way to handle dataset bias is to re-train on or fine tune to the new dataset, but both of these strategies require ground truth labels, which are laborious and often expensive to collect. Instead, we propose unsupervised methods for improving the quality of the underlying detector on new scenes given a large collection of photographs of the given scene, and show a substantial improvement over the baseline pre-trained models.