Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Exploring Visual Perception with Transformers and World Model Representation

Abstract

This research explores the development of generalized representations in artificial intelligence by leveraging visual attention and world models. The primate visual system processes vast amounts of sensory data through bidirectional visual pathways, utilizing top-down influence, such as visual attention, from high-level cognitive processes to affect early-stage visual processing. Drawing inspiration from this, attention-based visual systems, with the transformer model as the most prominent example, have significantly advanced computer vision by incorporating top-down information, which enables them to exhibit adaptability and versatility when processing a wide range of complex visual tasks. Concurrently, the concept of world models connects visual perception to higher-level cognitive processes and has led to a renaissance in model-based reinforcement learning. Deepening our understanding of visual attention and world models is an essential step towards achieving general artificial intelligence capable of performing a wide range of visual tasks and following complex instructions.

The thesis begins by focusing on designing transformer-based attention mechanisms in visual representation learning for diverse computer vision tasks. First, The research explores the application of these attention mechanisms to develop a geometry perception framework for line segment detection. Next, It presents a novel transformer-based visual system capable of handling multi-scale and contextual information. Furthermore, the thesis highlights the use of attention mechanisms in modeling spatial relationships among object parts in few-shot classification tasks. Lastly, it explores the potential of model-based reinforcement learning algorithms for efficient task transfer, introducing a framework that leverages learned world models to accelerate learning in new and distinct tasks. Through these works, we contribute to ongoing efforts to develop AI systems that closely resemble the flexibility and versatility of the human brain.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View