When do machine learning models generalize well? A signal-processing perspective
Contemporary machine learning systems have demonstrated tremendous success at a variety of tasks including image classification, object detection and tracking, and recommendation algorithms.This success has been driven by the enormous advances in computation capabilities that enable us to utilize big training datasets, with large number classes and train models with a vast number of parameters. In fact, these systems use models that have sufficient model capacity to be trained to zero training error on noisy or even completely random labels.
However, these models often generalize well in practice and avoid harmful ``overfitting''. The key to good generalization lies in the implicit bias of the model architecture and training algorithm that steers us towards solutions that generalize well.This thesis works towards a better theoretical understanding of this phenomenon by analyzing overparameterized linear models and proving sufficient and necessary conditions for good generalization. Additionally, we also empirically investigate whether we can engineer the correct implicit bias when training models to solve practical problems in the field of control by making use of our knowledge about the problem domain.
We start by analyzing the simpler setting of overparameterized linear regression, fitting a linear model to noisy data when the number of features exceeds the number of training points. By taking a Fourier-theoretic perspective we map the key challenge posed by overparameterization to the well-known phenomenon of aliasing of the true signal due to under-sampling.Borrowing from the signal-processing concepts of ``signal bleed’’ and ``signal contamination'', we derive conditions for good generalization for the Fourier-feature setting.
Next, we analyze the generalization error for the minimum-$\ell_2$-norm interpolator for the regression and binary classification problems in a Gaussian-feature setting. For regression, we interpolate real-valued labels and for binary classification, we interpolate binary labels. (It turns out that under sufficient overparameterization, minimum-norm interpolation of binary labels is equivalent to other binary-classification training methods such as support-vector machines or gradient descent on logistic loss.)We study an asymptotic setting where the number of features $d$ scales with the number of training points $n$ and both $n,d \rightarrow \infty$. Under a bi-level spiked covariance model for the features we prove the existence of an intermediate regime where we we perform well on the classification task but not on a corresponding regression task.
We then extend the analysis to the multiclass classification setting where the number of classes also scales with the number of training points, by deriving asymptotic bounds on the classification error incurred by the minimum-norm interpolator of one-hot encoded labels.
Finally, to understand how we can learn models that generalize well in practice, we empirically study the application of neural networks to learn non-linear control strategies for hard control problems where the optimal solutions are unknown and linear solutions are provably sub-optimal. By intelligently designing neural network architectures and training methods that leverage our knowledge of the dynamics of the control system, we are able to more easily and robustly learn control strategies that perform well.