Skip to main content
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Structured Models for Vision-and-Language Reasoning


Vision-and-language tasks (such as answering a question about an image, grounding a referring expression, or following a natural language instruction to navigate through a visual environment) require jointly modeling and reasoning over the two modalities of image and text. We have witnessed significant progress in joint visual and linguistic reasoning, often through neural approaches trained with the help of larger datasets and more computation resources. However, is solving these vision-and-language tasks as simple as building models with more parameters and training them on more data? If not, how can we build better reasoning models that are data-efficient and generalize well?

This thesis provides an answer to the above question with structured models for vision-and-language reasoning – models with architectures that take into account the patterns and regularities in human language, visual scenes, and agents’ skills. We begin with the task of referring expression grounding, where we show that significantly better accuracy and generalization can be achieved by taking into account the compositional structures in these expressions with our proposed Compositional Modular Networks (CMNs) in Chapter 2. We further address the visual question answering task in Chapter 3 with the End-to-End Module Networks (N2NMNs) based on dynamic compositional modules that align with the reasoning steps in the questions. In Chapter 4, we extend our work on modular reasoning and propose the Stack Neural Module Networks (SNMNs) that automatically induce a proper module layout with interpretable reasoning steps. Beyond modular reasoning, we also propose to construct context-aware representations of the visual scene with Language-Conditioned Graph Networks (LCGNs) in Chapter 5 for relational reasoning, and address the problem of reading text in images for question answering with iterative pointer-augmented multimodal transformers in Chapter 6. Finally, we show that embodied tasks also require structured models, and propose the Speaker-Follower models for the navigational instruction following task in Chapter 7 with the pair of a speaker model and a follower model that complement each other. In all these scenarios, we show that by taking into account the structures in the tasks and the input modalities, our models perform and generalize significantly better than their unstructured counterparts.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View