Beyond Bounding Boxes: Precise Localization of Objects in Images
Object recognition in computer vision comes in many flavors, two of the most popular being object detection and semantic segmentation. Object detection systems detect every instance of a category in an image, and coarsely localize each with a bounding box. Semantic segmentation systems assign category labels to pixels, thus providing pixel-precise localization but failing to resolve individual instances of the category. We argue for a richer output: recognition systems should detect individual instances of a category and provide pixel precise segmentations for each, a task we call Simultaneous Detection and Segmentation or SDS. We describe approaches to this task that leverage convolutional neural networks for precise localization. We also show that the techniques we develop are also effective for other tasks such as segmenting the parts of a detected object or localizing its keypoints. These are our first steps towards a recognition system that goes beyond category labels and coarse bounding boxes to precise, detailed descriptions of objects in images.