Over the last decade, machine learning practitioners in fields like computer vision and natural language processing have devoted vast resources to building models that successively improve performance numbers on a small number of prominent benchmarks. While performance on these benchmarks has steadily increased, real-world deployments of learning systems continue to encounter difficulties with robustness and reliability. The contrast between the optimistic picture of progress painted by benchmark results and the challenges encountered by real systems calls into question the validity of benchmark datasets, that is, the extent to which benchmark findings generalize to new settings. In this thesis, we probe the validity of machine learning benchmarks from several perspectives.
We first consider the statistical validity of machine learning benchmarks. Folk-wisdom in machine learning says that repeatedly reusing the same dataset for evaluation invalidates standard statistical guarantees and can lead to overoptimistic estimates of performance. We test this hypothesis via a dataset reconstruction experiment for the Stanford Question Answering Dataset (SQuAD). We find no evidence of overfitting from test-set reuse. This result is consistent with a growing literature which finds no evidence of so-called adaptive overfitting in benchmarks using image and tabular data. We offer a new explanation for this phenomenon based on the observed similarity between models being evaluated, and we formally show this type of model similarity offers improved protection against overfitting.
While statistical validity appears to be less of a concern, our experiments on SQuAD reveal that predictive performance estimates are extremely sensitive to small changes in the distribution of test examples, which threatens the external validity of such benchmarks. To understand the breadth of this issue, we conduct a large-scale empirical study of more 100,000 models across 60 different distribution shifts in computer vision and natural language processing. Across these many distribution shifts, we observe a common phenomenon: small changes in the data distribution lead to large and uniform performance drops across models. Moreover, this drop is often governed by a precise linear relationship between the performance on the benchmark and performance on new data that holds across model architectures, training procedures, and dataset size. Consequently, sensitivity to distribution shift is likely an intrinsic property of existing benchmark datasets and not something that is easily addressed by algorithmic or modeling innovations.
Taken together, these results highlight the difficulties with using narrow, static benchmarks to build and evaluate systems deployed in a dynamic world. In the final part of the thesis, we present two new resources to improve the evaluation of such systems. In the context of algorithmic fairness, we present a new collection of datasets derived from US Census data that explicitly includes data across multiple years and all US states. This allows researchers to evaluate new models and algorithms in presence population changes due to temporal shift and geographic variation. In the context of causal inference, we introduce a simulation framework that repurposes dynamical system models from climate science, economics, and epidemiology for the evaluation of causal inference tools across a variety of data generating distributions both when the assumptions of such tools are satisfied and when they are not.