- Main
Surprising Empirical Phenomena of Deep Learning and Kernel Machines
- Hui, Like
- Advisor(s): Belkin, Mikhail MB
Abstract
Over the past decade, the field of machine learning has witnessed significant advancements in artificial intelligence, primarily driven by empirical research. Within this context, we present various surprising empirical phenomena observed in deep learning and kernel machines. Among the crucial components of a learning system, the training objective holds immenseimportance. In the realm of classification tasks, the cross-entropy loss has emerged as the dominant choice for training modern neural architectures, widely believed to offer empirical superiority over the square loss. However, limited compelling empirical or theoretical evidence exists to firmly establish the clear-cut advantage of the cross-entropy loss. In fact, our findings demonstrate that training with the square loss achieves comparable or even better results than the cross-entropy loss, even when computational resources are equalized.
However, it remains unclear how the rescaling hyperparameter R, needs to vary with the number of classes. We provide an exact analysis for a 1-layer ReLU network in the proportional asymptotic regime for isotropic Gaussian data. Specifically, we focus on the optimal choice of R as a function of (i) the number of classes, (ii) the degree of overparameterization, and (iii) the level of label noise. Finally, we provide empirical results on real data, which supports our theoretical predictions.
Afterwards, to avoid extra parameters brought by the rescaling of the square loss (in cases when class number is large), later on we propose the “squentropy” loss, which is the sum of the cross-entropy loss and the average square loss over the incorrect classes. We show that the squentropy loss outperforms both the pure cross entropy and rescaled square losses interms of the classification accuracy and model calibration. Also, squentropy loss is a simple “plug-and-play” replacement of cross-entropy as it requires no extra hyperparameters and no extra tuning on optimization parameters.
Also, we apply theoretically well-understood kernel machines to practical challenging tasks, speech enhancement, and found that kernel machines actually outperform fully connected networks and require less computation resources. In another work, we investigate the correlation between the Neural Collapse phenomenon proposed by Papyan, Han, & Donoho (2020) andgeneralization in deep learning. We give precise definitions and their corresponding feasibility on generalization, which clarify neural collapse concepts. Moreover, our empirical evidence supports our claim that neural collapse is mainly an optimization phenomenon.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-