In recent years, convolutional networks have dramatically (re)emerged as the dominant paradigm for solving visual recognition problems.
Convnets are effective and appealing machines because they are made of a few simple, efficient building blocks, and are learnable end-to-end with straightforward gradient descent methods.
However, convnets have often been construed as black-box classification machines, which receive whole images as input, produce single labels as output, and leave uninterpretable activations in between.
This work addresses the extension of convnets to rich prediction problems requiring localization along with recognition.
Given their large pooling regions and training from whole-image labels, it has not been clear that classification convnets derive their success from an accurate correspondence model which could be used for precise localization.
In the first part of this work, we'll conduct an empirical study of convnet features, asking the question: do networks designed and trained for classification alone contain enough information to understand the local, fine-scale content of images, such as the locations of parts and keypoints?
We'll see that convnet features are indeed effective for tasks requiring correspondence.
We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass alignment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011.
Encouraged by this positive result, in the second part, we'll go on to see how convolutional networks by themselves, trained end-to-end, pixels-to-pixels, are state-of-the-art semantic segmentation systems.
We achieve this by building ``fully convolutional'' networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
We define and detail the design space and history of \name s, and explain their application to spatially dense prediction tasks.
We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into \name s and transfer their learned representations by fine-tuning to the segmentation task.
We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.
Our FCNs achieve a 20\% relative improvement on PASCAL VOC compared to prior methods, as well as performance improvements on NYUDv2 and SIFT Flow, while inference takes less than one fifth of a second for a typical image.
In the third part, we go beyond pixel labeling to explore generating object regions directly from a convolutional network.
The networks we build produce object segments at multiple scales without intermediate bounding boxes, and without building an image pyramid, instead taking advantage of the natural pyramid of features present in a subsampling network.
We extend the usual notion of convolutional networks, which are indexed spatially, to a notion of pairwise networks, which are doubly indexed spatially.
We describe a wide set of design choices in this space, and relate existing approaches to models of this type.
By carefully examining the structure of learned weights in an existing region generating network (DeepMask), we see that one of the simplest operations on pairs improves performance at negligible cost.
Our final pure convnet region generator trains and tests in a fraction of a second per image, and produces competitive output on the COCO dataset.