This thesis contributes to the emerging field of nonlinear random matrix theory and deep learning theory. The main contributions are summarized as follows:
In the linear-width regime, where the network widths scale proportionally with the sample size, we include proof of the global laws of the neural tangent kernel (NTK) and conjugate kernel (CK) matrices across layers. For datasets with low-dimensional signal structures, we characterize the outlier eigenvalues and eigenvector alignments of the CK matrices, extending recent results on spiked covariance models.In the ultra-wide regime, where the first layer's width is much larger than the sample size, we show that spectra of both CK and NTK matrices converge to a deformed semicircle law.
Going beyond random initialization, we investigate the spectral properties of trained weight, CK, and NTK matrices through empirical and theoretical analyses. Empirically, it demonstrates the invariance of bulk spectra under small learning rates, the emergence of outliers with large learning rates, and the heavy-tailed distributions after adaptive gradient training, correlating them with feature learning.Theoretically, we prove the invariance of bulk spectra under small constant learning rates and characterize the feature learning phenomenon, where gradient descent optimizes the first-layer weights, leading to a rank-one spiked structure in the weight and CK matrices, with spike eigenvectors aligning with test labels.