Computer vision has made significant progress in locating and recognizing objects in recent decades. However, beyond the scope of this “what is where” challenge, it lacks the abilities to understand scenes characterizing human visual experience. Comparing with human vision, what is missing in current computer vision? One answer is that human vision is not only for pattern recognition, but also supports a rich set of commonsense reasoning about object function, scene physics, social intentions etc..
I build systems for real world applications and simultaneously pursuing a long-term goal of devising a unified framework that can make sense of an images and a scene by reasoning about the functional and physical mechanisms of objects in a 3D world. By bridging advances spanning fields of stochastic learning, computer vision, cognitive science, my research tackles following challenges:
(i) What is the visual representation? I develop stochastic grammar models to characterize spatiotemporal structures of visual scenes and events. The analogy of human natural language lays a foundation for representing both visual structure and abstract knowledge. I pose the scene understanding problem as parsing an image into a hierarchical structure of visual entities using the Stochastic Scene Grammar (SSG). With a set of production rules, the grammar enforces both structural regularity and flexibility of visual entities. Therefore, the algorithm is able to handle enormous number of configurations and large geometric variations for both indoor scenes and outdoor scenes.
(ii) How to reason about the commonsense knowledge? I augment the commonsense knowledge about functionality, physical stability to the grammatical representation. The bottom-up and top-down inference algorithms are designed for finding a most plausible interpretation of visual stimuli.
Functionality refers to the property of an object or scene, especially man-made ones, which has a practical use for which it was designed, and it's deeper than geometry and appearance and thus is a more invariant concept for scene understanding. We present a Stochastic Scene Grammar (SSG) as a hierarchical compositional representation which integrates functionality, geometry and appearance in a hierarchy. This represents a different philosophy that views vision tasks from the perspective of agents, that is, agents (humans, animals and robots) should perceive objects and scenes by reasoning their plausible functions.
Physical stability assumption assumes objects in the static scene should be stable with respect to the gravity field. In other words, if any object is not stable on its own, it must be either grouped with neighbors or fixed to its supporting base. We pursue a physically stable scene understanding, namely ``a parse tree", by inferring object stability in the physical world. The assumption is applicable to general scene categories thus poses powerful constraints for physically plausible scene interpretation and understanding.
(iii) How to acquire commonsense knowledge? I performed three case studies to acquire different kinds of commonsense knowledges: I teach the computer to learn affordance from observing human actions; to learn tool-use from single one-shot demonstration; and to infer containing relations by physical simulation without explicit training process. They provided some interesting perspectives on how to acquire and exploit commonsense knowledge. In general, the more prediction or simulation is performed, the less training data is needed. As a result, the acquired commonsense knowledge is more generalizable to new situations.
Such sophisticated understanding of 3D scenes enables computer vision to reason, predict, interact with the 3D environment, as well as hold intelligent dialogues beyond visible spectrum.