Conversational AI has seen tremendous progress in recent years, achieving near-human or even surpassing human performance in certain well-defined tasks, including speech recognition and question answering. Yet it tends to struggle with tasks which are less constrained, in particular those that involve producing human language. Current approaches to natural language generation (NLG) in dialogue systems still heavily rely on techniques that lack scalability and transferability to different domains, despite the general embrace of more robust methods by the NLG community, in particular deep learning (neural) models. These methods rely on large amounts of annotated data, yet they tend to produce generic, robotic, and boring responses that lack most of the human language nuances that make conversation creative and varied.
While the naturalness of the generated language is an important factor affecting the perceived quality of a dialogue system, semantic accuracy is also extremely important. If a system is not semantically accurate, it may provide the user with incorrect information or contradict its earlier responses. In this thesis, we focus on the task of generating an utterance from a structured meaning representation (MR). To support our work, we create and release a new parallel corpus with more varied dialogue acts and more conversational utterances than previous MR-to-text corpora. We explore different ways of promoting output diversity in neural data-to-text generation while ensuring high semantic accuracy by developing new methods to help deep learning NLG models produce diverse utterances that are faithful to their MRs. This is an important step toward making conversational AI more reliable and pleasant to interact with.
We first observe in our initial experiments that NLG models have the ability to produce more diverse and natural-sounding texts when explicitly prompted to, however, this diversity comes at the expense of semantic accuracy. This leads us to develop a set of methods for automatically assessing and enforcing semantic accuracy in the generated utterances. We introduce a general tool to find a semantic alignment between an utterance and the corresponding input, which can be used for automatically evaluating the accuracy of generated utterances and ranking a pool of candidate utterances a model produces. We also propose a novel semantically attention-guided decoding method for neural encoder-decoder models, which utilizes the models' own knowledge acquired from training in a way that enables them to track semantic accuracy during inference and rerank generated utterance candidates accordingly. We show on multiple datasets that both of these methods have an ability to dramatically reduce semantic errors in model outputs, while maintaining their overall quality and fluency.
We then systematically explore Monte-Carlo Tree Search (MCTS) as a way to simultaneously optimize both semantic accuracy and stylistic diversity during inference. To guide the MCTS, we propose a new referenceless automatic metric for utterance evaluation. Our results show that, using this novel method, we can successfully increase diversity while maintaining, or even improving, semantic accuracy.