Due to its strong predictive power, machine learning (ML) has increasingly shown considerable potential to disrupt a wide range of critical domains, such as medicine, healthcare, and finance. Along with this success, ML models have become more complex and parameter intensive. For instance, large language models (LLMs) pre-trained on massive amounts of internet text have become a default choice for many prediction problems. As a result, models are increasingly difficult to understand, establish trust in, and have become more data-intensive. To address the opaqueness of ML models, researchers have proposed explanation methods that help users understand why their models make predictions. Still, explanation methods often do not faithfully explain model predictions, and domain experts struggle to use them. As a result, it is important to understand how ML explanations fail, improve their robustness, and enhance their usability. Moreover, due to the increased data-intensiveness of many ML problems and the desire for widespread integration, there is a need for methods that achieve strong predictive performance more easily and cost-effectively.
In this dissertation, we address these problems in two main research thrusts: 1) We evaluate the shortcomings of explanation methods by developing adversarial attacks on such techniques, which provide insights into how these methods fall short. We propose novel explanation methods that are more robust to common issues these explanations suffer. 2) We develop language-based methods of interacting with explanations, enabling anyone to understand machine learning models. We extend these findings to a more general predictive setting where we improve model performance using natural language instructions to solve critical prediction tasks with only minimal training data.
First, we examine the limitations of explanation methods through the lens of adversarial attacks. We introduce adversarial attacks on two commonly used types of explanations: local post hoc explanations, and counterfactual explanations. Our methods reveal that it is possible to design ML models for whom explanations behave unfaithfully, demonstrating that they are not robust. We additionally analyze other limiting factors of explanations, such as their instability and inconsistency, and demonstrate how improved uncertainty quantification can alleviate these issues. To this end, we introduce two new explanation methods, including uncertainty estimates for explanations, BayesLIME and BayesSHAP, that overcome many of these robustness issues.
Second, we analyze the usability of current explanation methods and find that many subject matter experts, like healthcare workers or policy researchers, struggle to use them. To overcome these issues, we introduce TalkToModel: an interactive, natural language dialogue system for explaining ML models. Our real-world evaluations suggest TalkToModel dramatically helps improve the usability of ML explanations. Based on the finding that natural language is a highly useful interface between models and humans, we evaluate how well current LLMs utilize natural language instructions for solving tabular prediction tasks from instructions and introduce a benchmark of prediction tasks, TABLET, to this end. Taken together, these works offer new techniques for making ML models more accessible to end users through natural language.