Li, Liunian

Learning Visually Grounded Intelligence with Language

2024

Li, Liunian
Advisor(s): Chang, Kai-Wei

Abstract

To build an Artificial Intelligence system that can assist us in daily lives, the ability to understand the world around us through visual input is essential. Prior studies train visual perception models by defining concept vocabularies and annotate data against the fixed vocabulary. It is hard to define a comprehensive set of everything, and thus they are hard to generalize to novel concepts and domains. In this thesis, I turn to language as a scalable and effective tool to build visually grounded models. Intuitively, natural languages are the most effective medium of learning and communication for humans. I will introduce two lines of work to train models to understand the visual world with language as supervision. The first line of work is inspired by masked language modeling such as BERT, and extends that to build contextualized representation models for vision and language. These models can be fine-tuned to perform vision-language tasks such as answering questions about an image. The second line of work uses language to supervise object detection models and enables object detection with prompts, where the users could specify custom needs and domain knowledge in a text prompt, and the model situates its predictions based on the text on the fly.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Learning Visually Grounded Intelligence with Language