Reasoning about commonsense from visual input remains an important and challenging problem in the field of computer vision. It is important because the ability to reason about commonsense, plan and act accordingly, represents the most distinct competence that tells human apart from other animals---the ability of analogy. It is challenging partially due to the absence of the observations of all the typical examples in a given category, in which the objects often present enormous intra-class variations, leading to a long-tail distribution in the dimensions of appearance and geometry. This dissertation focuses on four largely orthogonal dimensions---functionality, physics, causality, and utility---in computer vision, robotics, and cognitive science, and it makes six major contributions:
We rethink object recognition from the perspective of an agent: how objects are used as ``tools" or ``containers" in actions to accomplish a ``task". Here a task is defined as changing the physical states of a target object by actions, such as, cracking a nut or painting a wall. A tool is a physical object used in the human action to achieve the task, such as a hammer or a brush, and it can be any daily objects which are not restricted to conventional hardware tools. This leads us to a new framework---task-oriented object modeling, learning and recognition, which aims at understanding the underlying functions, physics and causality in using objects as tools in various task categories.
We propose to go beyond visible geometric compatibility to infer, through physics-based simulation, the forces/pressures on various body parts as people interact with objects. By observing people's choices in videos, we can learn the comfort intervals of the pressures on body parts as well as human preferences in distributing these pressures among body parts. Thus, our system is able to ``feel'', in numerical terms, discomfort when the forces/pressures on body parts exceed comfort intervals. We argue that this is an important step in representing human utilities---the pleasure and satisfaction defined in economics and ethics (e.g., by the philosopher Jeremy Benthem) that drives human activities at all levels.
We propose to go beyond modeling the direct and short-term human interaction with individual objects. Through accurately simulating thermodynamics and air fluid dynamics, our method can infer indoor room temperature distribution and air flow dynamics at arbitrary time and locations, thus establishing a form of indirect and long-term affordance. Unlike chairs in a sitting scenario, the objects (heating/cooling sources) that provide affordance do not directly interact with a person. Instead, the air in a room serves as an invisible medium to pass the affordance from an object to a person. We coin this new form of affordance as intangible affordance.
By fusing functionality and affordance into indoor scene generation, we propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and numerous photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms.
We present four case studies on integrating forces and functionality in object manipulations in the field of robotics, showcasing the significance and benefits of explicit modeling of the functionality in task executions.
We introduce an intuitive substance engine (ISE) model employing probabilistic simulation, which supports the hypothesis that humans infer future states of perceived physical situations by propagating noisy representations forward in time using approximated rational physics.