Reconstruction happens in the human brain every day. When humans watch their surrounding scene, they effortlessly infer dynamic representations of scene geometry from sequences of images. This higher dimensional reconstruction not only helps to interpret the input data but also provides an important basis for performing complex, higher level tasks. In spite of the importance of scene reconstruction, it has been considered a difficult task since the amount of information in the input image (2D) is insufficient to fully reconstruct scene geometry (3D). Performing such a task clearly requires the use of prior knowledge. In this thesis, we explore the advantages of machine learning-based techniques in order to reconstruct geometric information from images or videos. We utilize deep neural networks and probabilistic models and demonstrate their effectiveness in reconstructing geometric information.
As the first part of this thesis, we estimate 2D motion flow from video, leveraging constraints of camera ego-motion and scene geometry. From a sequence of images, we first predict relative ego-motion between the input frames, and then reconstruct the camera trajectory. By considering the cycle consistency between 2D motion, depth and camera ego-motion, we train a model to reconstruct scene depth without additional supervision. These processes are trained using a self- supervised end-to-end convolutional neural network (CNN) architecture with motion field driven photometric consistency loss. To minimize accumulated error from imperfect local estimates, we predict relative reliability scores between every connected pair of input frames and then utilize them in global refinement.
In the second part, we reconstruct spatio-temporal 4D model from a set of 3D models. From a set of multiple 3D models, we optimize transformation parameters from each model space to global space using a Gaussian mixture to model point observations. We optimize alignment parameters using expectation-maximization algorithm and estimate the temporal extent for each 3D patches by maximizing expected posterior probability over time. This spatial-temporal model allows us to perform object segmentation as well as infer the existence of occluded objects.