Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

On the Robustness of Robustness and Counterfactual Bias Evaluation

Abstract

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? In other words, if a model is robust or unbiased on a test set, will the properties still hold under a slightly perturbed test set? In this paper, we propose a ``double perturbation'' framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates ($96.0\%$\textendash$99.8\%$) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting protected tokens (e.g., gender, race), and measure the shift of the \emph{expected} prediction. In the experiments, our method reveals the hidden model bias even if the test set is adversarially chosen.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View