Recent advances in machine learning have led to increased deployment of
black-box classifiers across a wide variety of applications. In many such
situations there is a critical need to both reliably assess the performance of
these pre-trained models and to perform this assessment in a label-efficient
manner (given that labels may be scarce and costly to collect). In this paper,
we introduce an active Bayesian approach for assessment of classifier
performance to satisfy the desiderata of both reliability and label-efficiency.
We begin by developing inference strategies to quantify uncertainty for common
assessment metrics such as accuracy, misclassification cost, and calibration
error. We then propose a general framework for active Bayesian assessment using
inferred uncertainty to guide efficient selection of instances for labeling,
enabling better performance assessment with fewer labels. We demonstrate
significant gains from our proposed active Bayesian approach via a series of
systematic empirical experiments assessing the performance of modern neural
classifiers (e.g., ResNet and BERT) on several standard image and text
classification datasets.