Question Answering (QA) is a great way to test the natural language understanding of an artificial intelligence system. The recent advances in model architectures and large-scale datasets have led to the development of neural QA systems that surpass human performance on question answering. The reason behind the success of neural systems lies in their ability to directly learn features to extract answers from data. In contrast, symbolic systems encounter notable difficulties in scaling due to their restricted applicability to semi-structured or symbol-grounded data. Despite their reliance on structured data, symbolic systems demonstrate proficiency in executing deterministic operations and performing reasoning tasks. Conversely, neural systems exhibit limitations in reasoning, as they are (1) inconsistent, (2) unable to compose simple facts and perform complex reasoning, and (3) sensitive to changes in domain distribution.
In this dissertation, we present a range of data intervention schemes that facilitate in building consistent, decomposable, and generalizable neural QA systems. First, we show that purely neural systems are inconsistent and biased because most training and data collection procedures for neural systems make the independence assumptions and we explore two ways to address this problem. Second, we introduce a compositional QA dataset and show that neural QA methods lack decomposability. We propose a method to break down the complex questions using generated data into simpler, more manageable sub-questions to improve few-shot performance. Finally, we dissect the complex interactions among questions, answers, and documents learned by a neural QA system to assess their effectiveness towards generalization under a range of different data distributions through a series of generated data interventions and dynamic task sampling. Overall, we demonstrate how data interventions can be utilized to induce characteristics of symbolic systems into neural QA systems.