With the latest advances in conversational agents like Siri and Alexa, and Large Language Models (LLMs) like ChatGPT and PaLM, Question Answering (QA) systems have become more important. Users submit millions of queries per day and it is up to the system to provide reliable, to-the-point answers. In this dissertation, we explore various aspects to improve such QA systems.
First, we tackle the problem of collecting high-quality training data for QA systems. We especially focus on public frequently asked questions (FAQ) data on the Web. FAQ chatbots rely on good quality FAQ data but there is no good source of FAQ data available and collecting them is a tedious task. Given the plethora of such question-answer pairs on the Web, there is an opportunity to automatically build large FAQ collections for any domain. Automatically identifying and extracting such high-utility question-answer pairs is a challenging endeavor, which has been tackled by little research work. Although identifying general, self-contained FAQs may seem like a straightforward binary classification problem, the limited availability of training data for this task and the countless domains make building machine learning models challenging. We propose QuAX: a framework for automatically extracting high-utility (i.e., general and self-contained) domain-specific FAQ lists from the Web. QuAX receives a set of keywords from a user and works in a pipelined fashion to find relevant web pages and extract general and self-contained questions-answer pairs.
Second, it is challenging for open retrieval conversational QA (OrConvQA) to model the history of a user conversation, to better answer the last user question. State-of-the-art OrConvQA systems use the same history modeling for all three modules (Retriever, Reranker, Reader) of the pipeline. We hypothesize this as suboptimal. Specifically, we argue that a broader context is needed in the first modules of the pipeline to not miss relevant documents, while a narrower context is needed in the last modules to identify the exact answer span. We propose NORMY, the first unsupervised non-uniform history modeling pipeline that generates the best conversational history for each module. We further propose a novel Retriever for NORMY, which employs keyphrase extraction on the conversation history, and leverages passages retrieved in previous turns as additional context.
Third, with the prevalence of powerful LLMs, LLM-based Reranker modules need to process a large number of passages to re-rank them given a query. However, LLM APIs can be very expensive (especially ChatGPT, GPT-4, etc.). We propose EcoRank, a budget-constrained LLM-based passage re-ranker that intelligently chooses which passages to spend the budget on, with what prompt strategy, and with which LLM API, within the given budget. We design an LLM cascading pipeline with a mixture of cheaper and expensive APIs that achieves the best performance within the given budget.
Fourth, we focus on the Retriever component of the QA system. Retrieval becomes particularly challenging if the document corpus is not available or indexed locally and is accessed via APIs. For example, legal document retrieval systems like PACER, LexisNexis, etc. charge a fee for retrieving each document. We argue that to improve the retrieval accuracy, we need to expand by leveraging both feedback from already retrieved relevant documents and LLMs. We propose ProQE, a progressive query expansion algorithm that iteratively expands the query by retrieving documents, evaluating them, and updating the weights of the expanded terms using our novel scoring function.