A collection of images of a scene captured from different perspectives inform us about the scene's composition. Visualizing different perspectives of a scene in intuitive ways using simple techniques, like light-field visualization and panorama stitching, allow us to maximize the perception and understanding of a scene. While visualizing a collection of images in intuitive ways allows humans to consume the image content in a much richer way, it can also provide better ways for computer algorithms to consume the novel visualizations of the captured images and reason about the 3D information in the scene. In this thesis, we develop novel techniques to explore these two threads. To generate image-based visualizations, we use plane-induced homography transformations to warp and compare images.
Plane-induced homography transformations allow us to warp and align multiple images with different perspectives of a scene to a common 3D reference plane. We use this simple image warping operation, combined with the knowledge of the camera motion, to analyze the characteristics of flow between the warped images for the objects that lie on and off the 3D reference plane. We leverage the observed characteristics of the flow between homography-warped images to develop a computer graphics application for 3D scene visualization for human consumption. We also use it to develop a new learning-based framework to solve some core 3D computer vision tasks. More specifically, we investigate the following two application threads:
1) We propose a new framework for visualizing multiple perspectives of a scene and even combine them into a single multi-perspective image by rendering the 3D scene to a multi-perspective camera projection mapping. We showcase this via a computational photography application. By taking as input a collection of images captured by a dolly-in camera motion, we allow the photographers to change the relative size of imaged objects, change the sense of depth of the scene, and control the foreground perspective distortion in the final image composition. This allows photographers intuitive control for post-capture image composition.
2) We propose a new framework that uses multiple visualizations for a pair of images to pose simple questions about the 3D scene information to a neural-network and train it to provide binary answers. We use this framework and show that the core 3D computer vision problems like depth estimation for static scenes, and time-to-contact and optical-flow estimation for dynamic scenes can be solved via binary comparisons. Each binary comparison gives meaningful information in just a few milliseconds: Is an object in-front or beyond a 3D plane in the scene? Is an object's optical flow greater or smaller than a specified threshold? Will an object, static or dynamic, hit the camera plane within a specified time-to-contact? We also show that multiple binary comparisons could be combined together to get quantized or even continuous estimation when computational budget allows for it. Our framework achieves competitive performance with state-of-the-art results while providing a principled way to trade accuracy for latency at inference time, thus providing a useful tool for many robotics applications.