 Main
Beyond WorstCase Generalization in Modern Machine Learning
 Theisen, Ryan Christopher
 Advisor(s): Mahoney, Michael W
Abstract
This thesis is concerned with the topic of generalization in large, overparameterized machine learning systemsthat is, how models with many more parameters than the number of examples they are trained on can perform well on new, unseen data. Such systems have become ubiquitous in many modern technologies, achieving unprecedented success across a wide range of domains. Yet, a comprehensive understanding of exactly why modern machine learning models work so well has alluded fundamental understanding. Classical approaches to this question have focused on characterizing how models will perform in the worstcase. However, recent findings have strongly suggested that such an approach is unlikely to yield the desired insight. This thesis is concerned with furthering our understanding of this phenomenon, with an eye towards moving beyond the worstcase. The contents of this thesis are divided into six chapters.
In Chapter 1, we introduce the problem of generalization in machine learning, and briefly overview some of the approachesboth recent and classicalthat have been taken to understand it.
Chapter 2 introduces a novel analyses of deep neural networks with positive homogeneous activation functions. We develop a method to provably sparsify and quantize the parameters of a model by sampling paths through the network. This directly leads to a covering of this class of neural networks, whose size grows with a measure of complexity we call the "path norm". Using standard techniques, we can derive new worstcase generalization bounds that improve on previous results appearing in the literature.
In Chapter 3, we take a critical look at the worstcase approach to understanding generalization. To do this, we develop a methodology to compute the full distribution of test errors for interpolating linear classifiers on realworld datasets, and compare this distribution to the performance of the worstcase classifier on the same tasks. We consistently find that, while truly poor, worstcase classifiers indeed exist for these tasks, they are exceedingly rareso much so that we expect to essentially never encounter them in practice. Moreover, we observe that as models become larger, test errors undergo a concentration around a critical threshold, with almost all classifiers achieving nearly the same error rate. These results suggest that the worstcase approach to generalization is unlikely to describe practical performance of large, overparameterized models, and that new approaches are needed.
In light of our findings in Chapter 3, in Chapter 4 we ask a complementary question: if modern machine learning systems perform nowhere near the worstcase, how close might their performance be to the bestcase? For an arbitrary classification task, we can quantify the best possible error attainable by any model with the Bayes error rate, though it is generally intractable to estimate for realistic datasets. To address this intractibility, we first prove that the Bayes error rate is invariant under invertible transformation of the input features. We then use normalizing flows to estimate an invertible map between the target data distribution and a simple base distribution, for which the Bayes error can be easily computed. We then evaluate a variety of stateoftheart architectures against this estimated Bayes error rate, and find that in some (but not all) cases, these models achieve very close to optimal error rates.
In Chapter 5, we investigate averagecase generalization via ensemblinga popular method wherein the predictions of multiple models are aggregated into a single, often more powerful predictor. In classical settings, the effectiveness of ensembling is wellunderstood; however, for ensembles comprised of deep neural networks, the benefits of this method are surprisingly inconsistent. To understand why, we first prove a new set of results relating the ensemble improvement rate (a measure of how much ensembling decreases error relative to the average error rate of a single model) to the ratio of model disagreement to the average error rate. This results in new oracle bounds on the error rate of ensemble classifiers, significantly improving on prior results in the literature. We then investigate ensembling experimentally and find, most notably, a distinct and consistent transition in the rate of ensemble improvement (and the disagreementerror ratio) occuring at the interpolation thresholdthe point at which individual classifiers achieve exactly zero training error. Our findings suggest that ensembling is significantly less effective in the modern, overparameterized regime than it is in more classical settings.
Finally, in Chapter 6 we conclude with some reflections on the state of the field, and outlook for how it might advance in the coming years.
Main Content
Enter the password to open this PDF file:













