Event understanding is one of the most fundamental problems in artificial intelligence and computer vision. Rooted in the field of neuroscience, the study and analysis of human motion perception have long suggested that we perceive human activities as goal-directed behaviors. As an essential capability of humans, we interpret others’ goals and learn tasks through the endless video stream of daily activities. To endow machines with the same intelligent behaviors, the challenges of emerging such a capability lie in the difficulty of generating a detailed understanding of world model knowledge including situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism in human multi-tasking, and partial observations originated from both the egocentric perception and uncertainties in estimating others’ beliefs in multi-agent collaborations.
In this dissertation, we propose to study this missing gap from both the data and the modeling perspective by incorporating knowledge of the world model for proper event parsing, prediction, and reasoning. First, we propose three datasets, RAVEN, LEMMA, and EgoTaskQA, to study the event understanding problem from both the abstract and real domain. We further devise three benchmarks to evaluate models’ detailed understanding of events with (1) intelligence tests for spatial-temporal reasoning in RAVEN, (2) compositional action recognition and prediction in LEMMA, and (3) task-conditioned question answering in EgoTaskQA. Next, from the modeling side, we decompose the problem of event understanding into a unified framework that involves three essential modules: grounding, inference, and the knowledge base. To properly solve the problem of detailed event understanding, we need to focus on (1) the perception problem for grounding, (2) the knowledge representation problem, and (3) the inference problem. For the perception problem, we discuss the potential in existing models and propose the BO-QSA for the unsupervised emergence of object-centric concepts. For the inference problem, we discuss ways to initialize the overall framework with (1) PrAE which makes use of probabilistic abductions given logical rules, and (2) GEP which leverages stochastic context-free grammars for modeling. We conduct experiments to show their effectiveness on various tasks and also discuss the limitations of each proposed work tohighlight immediate next steps for possible future directions.