 Main
When do machine learning models generalize well? A signalprocessing perspective
 Subramanian, Vignesh
 Advisor(s): Sahai, Anant
Abstract
Contemporary machine learning systems have demonstrated tremendous success at a variety of tasks including image classification, object detection and tracking, and recommendation algorithms.This success has been driven by the enormous advances in computation capabilities that enable us to utilize big training datasets, with large number classes and train models with a vast number of parameters. In fact, these systems use models that have sufficient model capacity to be trained to zero training error on noisy or even completely random labels.
However, these models often generalize well in practice and avoid harmful ``overfitting''. The key to good generalization lies in the implicit bias of the model architecture and training algorithm that steers us towards solutions that generalize well.This thesis works towards a better theoretical understanding of this phenomenon by analyzing overparameterized linear models and proving sufficient and necessary conditions for good generalization. Additionally, we also empirically investigate whether we can engineer the correct implicit bias when training models to solve practical problems in the field of control by making use of our knowledge about the problem domain.
We start by analyzing the simpler setting of overparameterized linear regression, fitting a linear model to noisy data when the number of features exceeds the number of training points. By taking a Fouriertheoretic perspective we map the key challenge posed by overparameterization to the wellknown phenomenon of aliasing of the true signal due to undersampling.Borrowing from the signalprocessing concepts of ``signal bleed’’ and ``signal contamination'', we derive conditions for good generalization for the Fourierfeature setting.
Next, we analyze the generalization error for the minimum$\ell_2$norm interpolator for the regression and binary classification problems in a Gaussianfeature setting. For regression, we interpolate realvalued labels and for binary classification, we interpolate binary labels. (It turns out that under sufficient overparameterization, minimumnorm interpolation of binary labels is equivalent to other binaryclassification training methods such as supportvector machines or gradient descent on logistic loss.)We study an asymptotic setting where the number of features $d$ scales with the number of training points $n$ and both $n,d \rightarrow \infty$. Under a bilevel spiked covariance model for the features we prove the existence of an intermediate regime where we we perform well on the classification task but not on a corresponding regression task.
We then extend the analysis to the multiclass classification setting where the number of classes also scales with the number of training points, by deriving asymptotic bounds on the classification error incurred by the minimumnorm interpolator of onehot encoded labels.
Finally, to understand how we can learn models that generalize well in practice, we empirically study the application of neural networks to learn nonlinear control strategies for hard control problems where the optimal solutions are unknown and linear solutions are provably suboptimal. By intelligently designing neural network architectures and training methods that leverage our knowledge of the dynamics of the control system, we are able to more easily and robustly learn control strategies that perform well.
Main Content
Enter the password to open this PDF file:













