In spite of extensive consumer interest in domains such as autonomous driving, general purpose visual perception for autonomy remains a challenging problem. Inertial sensors such as gyroscopes and accelerometers, commonplace on smartphones today, offer complementary capabilities to visual sensors, and robotic systems typically fuse information from both visual and inertial sensors to further enhance their capabilities. In this thesis, we explore models and algorithms for constructing a representation from the fusion of visual and inertial data with the goal of supporting autonomy.
We first analyze the observability properties of the standard model for visual-inertial sensor fusion, which combines Structure-from-Motion (SFM, or Visual SLAM) with inertial navigation, and overturn the common consensus that the model is observable. Informed by this analysis we develop robust inference techniques to enable our real-time visual-inertial navigation system implementation to achieve state-of-the-art relative motion estimation performance in challenging and dynamic environments. From the information provided by this process, we construct a representation of the agent's environment to act as a map that significantly improves location search time relative to standard methods, enabling long-term consistent localization within a constrained environment in real-time using commodity computing hardware. Finally, to allow an autonomous agent to reason about its world at a granularity relevant for interaction, we construct a dense surface representation upon this consistent map and develop algorithms to segment the surfaces into objects of potential relevance to the agent's task.