Neural conversational dialogue agents often produce uninteresting, broad responses, such as “Yes” or “I don't know.” While these responses can be appropriate in a variety of contexts, if a model over-produces these typical responses, this leads to a dull conversation. This well-documented phenomenon is known as the diversity problem. This dissertation examines the diversity problem and proposes ways to improve dialogue agents in both the single- and multi-response setting.
In the single-response setting, the dialogue model is tasked with generating one utterance to continue a conversation. In this setting, a dialogue model's diversity is measured by its ability to generate diverse responses to different conversations. I propose a data collection procedure aimed at increasing the diversity of a corpus, called Diversity-Informed Data Collection (DIDC). While prior work modifies decoding procedures to increase model diversity, DIDC addresses the diversity problem at the dataset level. DIDC uses dynamically computed corpus-level statistics to determine which conversational participants to collect more data from. DIDC produces significantly more diverse data than baseline data collection methods. Additionally, training dialogue models on a more diverse corpus results in more diverse responses. DIDC is generalizable and can be used with other corpus-level metrics.
The next two contributions consider the task of generating multiple responses for a single conversation. Diversity examined in this setting measures a model's ability to generate multiple varied responses for the same input. First, I propose a novel metric which uses Natural Language Inference (NLI) to measure the semantic diversity of a set of model responses for a conversation. I evaluate this metric using an established framework and find strong evidence indicating NLI Diversity is correlated with semantic diversity. I show that the contradiction relation is more useful than the neutral relation for measuring this diversity. I additionally demonstrate how to iteratively improve the semantic diversity of a sampled set of model responses via a new generation procedure called Diversity Threshold Generation, which results in an increase in NLI Diversity compared to standard generation procedures.
Finally, I hypothesize that some conversations constrain the type of responses which are appropriate, therefore limiting the diversity one would expect in a set of responses. I explore the relationship between speech acts present in the input conversation and the diversity of a set of output responses. I propose the concept of Pragmatically Appropriate Diversity, the extent to which a conversation creates and constrains the creation of multiple diverse responses. Using a multi-response dataset, I find significant differences among NLI Diversity of responses for different speech act utterances. I use these findings to explore whether expert creative writers can predict the Pragmatically Appropriate Diversity from an input conversation, finding significant differences between the Pragmatically Appropriate Diversity among different speech acts. This contribution provides a framework to incorporate pragmatic conversational information into the evaluation of neural dialogue models.