Multimodal Conversation Modeling via Neural Perception, Structure Learning, and Communication
Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Multimodal Conversation Modeling via Neural Perception, Structure Learning, and Communication

Abstract

Multimodal conversation modeling is an important and challenging problem when building conversational agents. Pioneer works mostly focus on end-to-end multimodal fusion techniques, which require large volumes of pairwise data and lacks interpretability.This dissertation aims at closing the loop of vision and language multimodal modeling from the perspectives of neural perception, structure learning, and communication. Specifically, it makes four major contributions: 1. We explicitly model the joint distribution of vision and language as a Gibbs distribution. Then, we propose an "analysis by synthesis" cooperative training schema that uses the learned joint distribution to sample from one modality to another, e.g. category to image, attribute to image, etc. Further, we argue that such a training paradigm can be explained in the cognitive theory, where the conditional generator is a fast-thinking initializer that provides a rough output and the sampling process is a slow-thinking solver that refines the output with detailed multimodal information. 2. We propose to view the multimodal dialogue as a graph, where each node is a round of dialogue and the edges represent the semantic dependencies among dialogue turns. Moreover, we propose an Expectation-Maximization (EM)-based algorithm that can both predict partially observed nodes and infer graph structures. We show that such an unsupervised structure learning paradigm can provide post-hoc interpretability to various multimodal dialogue tasks. 3. We present a crucial but barely discussed challenge -- implicature and pragmatics -- in the field of conversational reasoning. We show that human communicate based on their intents and beliefs, where implicatures commonly come along. Considering the missing gap in the current natural language community, we propose a dataset generation protocol based on Spatial-Temporal And-Or-Graphs (ST-AOGs). We show that most of the state-of-the-art language models result in a large performance gap compared with humans. 4. We present a human-robot collaboration task -- bomb defusing game, that requires explanation to help human understand machine's behavior. We argue that such explanations should be generated according to the user's mental preferences, i.e. utilities. Therefore, we propose an explanation generation algorithm based on Hidden Markov Model (HMM), which considers the user's mental utilities as a hidden variable that changes based on observations. We show that, compared with rule-based conversational system, our generated explanations are more natural and are helpful in gaining human trust.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View