Local and Adaptive Image-to-Image Learning and Inference
Much of the recent progress on visual processing has been driven by deep learning and its bicameral heart of composition and end-to-end optimization. The machinery of convolutional networks is now ubiquitous. Its diffusion however was neither instantaneous nor effortless. To advance across the frontiers of vision, deep learning had to be equipped with the right structures: the true, intrinsic structures of the visual world.
This thesis incorporates locality and scale structure into end-to-end learning for visual recognition. Locality structure is key for addressing image-to-image tasks that take image inputs and return image outputs. Scale structure is ubiquitous, and optimizing over it learns the degree of locality for the task and data. Alongside structure, this thesis examines adaptive computation to help cope with the variability of rich image-to-image prediction problems. These directions are studied through the lens of local recognition tasks that require inference of what and where.
Fully convolutional networks decompose image-to-image learning and inference into local scopes. Factorizing these scopes into structured and free-form parts, and learning both, optimizes their size and shape to control the degree of locality. Adaptive computation across time, computing layers according to their rate of change, exploits temporal locality to improve the efficiency of video processing. Adaptive computation across tasks, extracting a latent representation of local supervision, transcends locality to non-locally guide and correct inference. Locality is the defining principle of our fully convolutional networks. Adaptivity equips our networks to more fully engage with the vastness and variety of vision.