This thesis is concerned with the topic of generalization in large, over-parameterized machine learning systems---that is, how models with many more parameters than the number of examples they are trained on can perform well on new, unseen data. Such systems have become ubiquitous in many modern technologies, achieving unprecedented success across a wide range of domains. Yet, a comprehensive understanding of exactly why modern machine learning models work so well has alluded fundamental understanding. Classical approaches to this question have focused on characterizing how models will perform in the worst-case. However, recent findings have strongly suggested that such an approach is unlikely to yield the desired insight. This thesis is concerned with furthering our understanding of this phenomenon, with an eye towards moving beyond the worst-case. The contents of this thesis are divided into six chapters.
In Chapter 1, we introduce the problem of generalization in machine learning, and briefly overview some of the approaches---both recent and classical---that have been taken to understand it.
Chapter 2 introduces a novel analyses of deep neural networks with positive homogeneous activation functions. We develop a method to provably sparsify and quantize the parameters of a model by sampling paths through the network. This directly leads to a covering of this class of neural networks, whose size grows with a measure of complexity we call the "path norm". Using standard techniques, we can derive new worst-case generalization bounds that improve on previous results appearing in the literature.
In Chapter 3, we take a critical look at the worst-case approach to understanding generalization. To do this, we develop a methodology to compute the full distribution of test errors for interpolating linear classifiers on real-world datasets, and compare this distribution to the performance of the worst-case classifier on the same tasks. We consistently find that, while truly poor, worst-case classifiers indeed exist for these tasks, they are exceedingly rare---so much so that we expect to essentially never encounter them in practice. Moreover, we observe that as models become larger, test errors undergo a concentration around a critical threshold, with almost all classifiers achieving nearly the same error rate. These results suggest that the worst-case approach to generalization is unlikely to describe practical performance of large, over-parameterized models, and that new approaches are needed.
In light of our findings in Chapter 3, in Chapter 4 we ask a complementary question: if modern machine learning systems perform nowhere near the worst-case, how close might their performance be to the best-case? For an arbitrary classification task, we can quantify the best possible error attainable by any model with the Bayes error rate, though it is generally intractable to estimate for realistic datasets. To address this intractibility, we first prove that the Bayes error rate is invariant under invertible transformation of the input features. We then use normalizing flows to estimate an invertible map between the target data distribution and a simple base distribution, for which the Bayes error can be easily computed. We then evaluate a variety of state-of-the-art architectures against this estimated Bayes error rate, and find that in some (but not all) cases, these models achieve very close to optimal error rates.
In Chapter 5, we investigate average-case generalization via ensembling---a popular method wherein the predictions of multiple models are aggregated into a single, often more powerful predictor. In classical settings, the effectiveness of ensembling is well-understood; however, for ensembles comprised of deep neural networks, the benefits of this method are surprisingly inconsistent. To understand why, we first prove a new set of results relating the ensemble improvement rate (a measure of how much ensembling decreases error relative to the average error rate of a single model) to the ratio of model disagreement to the average error rate. This results in new oracle bounds on the error rate of ensemble classifiers, significantly improving on prior results in the literature. We then investigate ensembling experimentally and find, most notably, a distinct and consistent transition in the rate of ensemble improvement (and the disagreement-error ratio) occuring at the interpolation threshold---the point at which individual classifiers achieve exactly zero training error. Our findings suggest that ensembling is significantly less effective in the modern, over-parameterized regime than it is in more classical settings.
Finally, in Chapter 6 we conclude with some reflections on the state of the field, and outlook for how it might advance in the coming years.