Scene understanding and event understanding of humans correspond to the spatial and temporal aspects of computer vision. Such abilities serve as a foundation for humans to learn and perform tasks in the world we live in, thus motivating a task-oriented representation for machines to interpret observations of this world.
Toward the goal of task-oriented scene understanding, I begin this thesis by presenting a human-centric scene synthesis algorithm. Realistic synthesis of indoor scenes is more complicated than neatly aligning objects; the scene needs to be functionally plausible, which requires the machine to understand the tasks that could be performed in the scene.
Instead of directly modeling the object-object relationships, the algorithm learns the human-object relations and generate scene configurations by imagining the hidden human factors in the scene. I analyze the realisticity of the synthesized scenes, as well as its usefulness for various computer vision tasks. This framework is useful for backward inference of 3D scenes structures from images in an analysis-by-synthesis fashion; it is also useful for generating data to train various algorithms.
Moving forward, I introduce a task-oriented event understanding framework for event parsing, event prediction, and task planning. In the computer vision literature, event understanding usually refers to action recognition from videos, i.e., "what is the action of the person". Task-oriented event understanding goes beyond this definition to find out the underlying driving forces of other agents. It answers questions such as intention recognition ("what is the person trying to achieve"), and intention prediction ("how the person is going to achieve the goal"), from a planning perspective.
The core of this framework lies in the temporal representation for tasks that is appropriate for humans, robots, and the transfer between these two. In particular, inspired by natural language modeling, I represent the tasks by stochastic context-free grammars, which are natural choices to capture the semantics of tasks, but traditional grammar parsers (e.g., Earley parser) only take symbolic sentences as inputs. To overcome this drawback, I generalize the Earley parser to parse sequence data which is neither segmented nor labeled. This generalized Earley parser integrates a grammar parser with a classifier to find the optimal segmentation and labels. It can be used for event parsing, future predictions, as well as incorporating top-down task planning with bottom-up sensor inputs.