In the past decades, the goal of computer vision, as coined by Marr, is to compute what are where by looking. The paradigm has guided the geometry-based approaches in the 1980s-1990s and appearance-based methods in the past years. Despite of the remarkable progress in recognizing objects, actions, and scenes by using large data sets, better designed features, and machine learning techniques, performances in complex tasks are still far from being satisfactory. One example is the first accident caused by Google's self-driving car in Feb 2016, the accident happened despite the fact that the car's 360-degree sensors likely saw the bus coming, the software made a wrong assumption that the bus behind would yield. Therefore, it can be seen that some complex computer vision tasks cannot be solved by the visible appearance alone.
The goal of this thesis is to look for a bigger picture to model and reason the missing dimensions, the mind of agents. By borrowing the powerful concept ``dark matter'' from Physics, we call this area as ``dark vision''. In this thesis, the mind of agents is inferred in spatial and temporal domain jointly. The framework including spatial reasoning in multi-scale space, and temporal reasoning in both observed story in the past and unseen events in the future.
1) Intention means the mind of an agent about the future plan. Dark matter corresponds to entities which are unfeasible to recognize by visual appearances. This includes, not exclusively, i) status of an agent (human, animals or robot)'s goals and intents, like hungry, thirsty, which trigger actions; and ii) attraction relations between an object (like food) and an agent (hungry). Therefore, functional objects can be viewed as ``dark matter'', emanating ``dark energy'' that affects people's trajectories in the video. A Bayesian framework is used to probabilistically model: people's trajectories and intents, constraint map of the scene, and locations of functional objects.
2) Attention represents the mind of an agent at current time. Gaze refers to the location where a person is looking, and attention purpose explains why a person is looking at that location, e.g., to locate a cup. The method in this thesis computes not only human gaze locations in 3D space, but also attention purpose categories in task-driven actions. The human gaze and attention are decomposed into relations among human skeletons, objects, and human gazes in spatial-temporal domain. Such relations are represented by a stochastic graph learned by maximum likelihood estimation in a supervised way.
3) A further step is to discover invisible relations in group activities. This thesis parses low-resolution aerial videos of large spatial areas, in terms of 1) grouping, 2) recognizing events and 3) assigning roles to people engaged in events. A spatiotemporal And-Or graph framework is proposed to conduct joint inference of the above tasks. This thesis also presents a three-layered And-Or graph to jointly model group activities, individual actions, and participating objects, which not only avoids running a multitude of detectors at all spatiotemporal scales, but also arrives at a holistically consistent video interpretation.
Of course, it is well-known that vision is an inverse, ill-posed problem where only the pixels are seen directly and anything else are hidden / latent. The concept of darkness is perpendicular to, and richer than, the meaning of latent / hidden used in vision and probabilistic modeling. It is a measure of the relative difficulty in inferring an entity or relation from the appearance. In computer vision, the literature addresses these dark entities and relations by a lump sum concept: it is context! But what is a formal definition of context? How many types of context are there? How information is passed between entities through context? The literature lacks an explicit and principled framework for joint representation and joint inference. This thesis proposes a framework to explore these ``dark'' dimensions of mind in an explicit manner.