Pixel-level prediction enables visual understanding at finer granularity, such as segmenting all the persons and vehicles and estimating their 3D shapes as well as distances from the camera. It distinguishes itself from image-level prediction, which is relatively coarse such as simply telling an image of a person from that of a car.
The solutions to pixel-level prediction are core to many real-world applications, spanning a variety of vision tasks from the low-level vision like image deblurring, to the mid- and high-level such as understanding scene geometry and all the objects' 3D shape and motion. It has been greatly advanced since the past decade and fostering new challenges and opportunities, owing to the fast development of hardware, the availability of large-scale dataset, and the resurgence of deep convolutional neural networks.
In this thesis, we study pixel-level prediction with new algorithms, innovative model architectures, and novel applications. We begin with training convolutional neural networks (CNN) for different pixel-level prediction tasks, and demonstrate that CNN acts as a unified framework. Within the framework, we propose novel modules to encode perceptual or cognitive principles, such as 1) objects appearing larger when closer to the camera, and 2) cognitive mechanism allowing one for perceiving the world with dynamical attention. We show the proposed modules not only achieve the state-of-the-art performance on different tasks, but also enables dynamic and parsimonious computation.
As the dataset for per-pixel labeling tasks requires painstaking per-pixel annotations, we propose the Predictive Filter Flow (PFF) framework to train over simulated images for image reconstruction tasks. PFF generates per-pixel kernels for warping the input towards the output, thus has better interpretability w.r.t decision making. We further present its multigrid extension (dubbed mgPFF) to train over unconstrained videos. We show successful applications of mgPFF to visual tracking and flow learning, as well as a unique interactive application for photo editing.