- Main
Understanding the Role of Optimization and Loss Function in Double Descent
- Liu, Chris (Yuhao)
- Advisor(s): Flanigan, Jeffrey J
Abstract
Double descent has emerged as a fascinating phenomenon that has been observed across a range of tasks, model architectures, and training paradigms. When double descent occurs, instead of decreasing monotonically, the generalization error initially decreases, then increases as it enters a critical parameterized regime, and finally decreases again. Despite its ubiquity, simple explicit regularization techniques like weight decay and early stopping have been successful in reducing double descent in both theoretical and practical contexts. However, we observe that, in realistic settings, double descent is reduced or does not occur even without any form of explicit regularization. This observation raises a key question: If overfit models do not exhibit the double descent phenomenon in practice, why not?
We identify two key reasons: 1) the use of poor optimizers that struggle to land at a low-loss local minimum even though they obtain zero training error, and 2) the presence of an exponential tail in the shape of the loss function. We further show that, given a sufficient number of iterations, poor optimizers can start to recover the peak. However, exponential-tail loss functions tend to be much more resistant to the peaking behavior even in the long term (when models are extremely overfit). Additionally, we show that loss functions suffering from the double descent phenomenon (e.g., MSE loss) can be made to exhibit monotonicity, that is no peaking behavior, when they are modified to have an exponential tail.
To validate our findings, we conduct experiments on a wide range of regression and classification loss functions using random feature models and two-layer neural networks trained on realistic datasets. Our results confirm the influence of the two factors identified above on the peaking behavior. These findings offer new insights into the phenomenon of double descent, which is crucial for understanding generalization in machine learning.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-