To interact with the environment, first we need to know where we are relative to it. Observability of the underlying dynamical model is a necessary condition for ANY algorithm to work, in the sense of yielding a unique point estimate. Our contribution is to show that all existing analysis of observability was flawed, and propose new analysis that shows that, contrary to popular belief, pose is not observable from visual and inertial sensors. However, we show that the ambiguous set is bounded, and compute it analytically.
Once we know where we are, We need to know what is around us. This is a problem called “mapping”. Building geometric maps (point clouds) well explored problem. However, to interact intelligently need more than point cloud, we need some understanding of topology. How is the world around us divided into “objects”? Chapter 4 talks about a way of organizing points into surfaces and then connected components of surfaces, that can be considered “objects” for the purpose of interaction, from video.
Once we know where we are and have a model of the (Visible) environment, with respect to which we know the location of an object of interest (point A), we need to know how to get to point A, which may not be visible. This requires exploration. Chapter 1 deals with this problem. The contribution is an efficient algorithm with provable bounds on the exploration time and amenable to be extended to non-compact domains (relevant in vision because one can see to infinity).
To explore the boundaries of this problem set, we also ask whether a representation is needed at all, at least for simple problems like going to point To this end, we explore the possibility of directly encoding/representing/optimizing the map from sensory data to control action, designed so as to achieve the goal (of getting to point A).