Ma, Xiaojian

A Unified Framework with Benchmarks for Human-like Visual and Relational Reasoning in the Real World

2023

Ma, Xiaojian
Advisor(s): Zhu, Song-Chun

Abstract

Cogito, ergo sum. Building machines that can think and reason like humans is a long-standing goal of AI. Despite the tremendous progress in AI we witnessed in recent years, it is still not clear whether these learning machines at scale can solve problems that require sophisticated thinking and reasoning, especially when the problems are also tied to ontologies (entities, relations) in the real world and raw sensory observations, i.e., visual and relational reasoning. Further, human-level reasoning and thinking also call for the capability of generalizing what the machine has learned to problem instances with their novel forms and combinations. We anticipate such generalization should be possible even with few data as well as on diverse modalities, e.g. vision, text, embodied 3D scenes, etc, which creates a significant gap between humans and machines.

This dissertation studies human-like visual and relational reasoning in the real world, aiming at closing the aforementioned gap between humans and machines. The first part of this thesis focuses on deepening the current understanding of the limitations of existing ML-based reasoning systems when compared to humans. To this end, a series of benchmarks are developed in hope of examining the full spectrum of anticipated capabilities of these systems, including zero-shot, few-shot generalization, and adaptation to difficult modalities including embodied 3D scenes. Based off these new quests for AI reasoning, thorough evaluations are conducted with recently proposed reasoning systems, and their limitations are discussed.

The second part of this dissertation introduces a unified framework by drawing inspiration from the human language system, which is grounded, entity-centric, semantically rich, and could be the key to human-level generalization in reasoning. Specifically, the problem of learning language-like representations from a generative learning perspective is investigated. The resulting models can facilitate learning object-centric representations from images and discrete-continuous hybrid representations from text using an energy-based formulation. Further, intuitive and scalable inductive biases are developed to leverage the semantic supervision from the English language to learn object-centric and relational representations, to directly tackle the challenging zero-shot systematic generalization problem in visual and relational reasoning. Finally, what could be the next major move in the field is highlighted.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

A Unified Framework with Benchmarks for Human-like Visual and Relational Reasoning in the Real World