Neural networks are rapidly increasing in size, leading to a common occurrence of overparameterization in deep learning. This presents challenges in both the theory and application of deep learning. From a theoretical standpoint, it remains an open question as to why neural networks generalize well despite overparameterization. From an application perspective, overparameterization leads to significant computation and storage costs, which limits the practical application of deep neural networks. This thesis presents our attempt to address both issues. In terms of application, we propose training a low-rank tensorized neural network to compress the model and reduce the computation cost during both training and inference. We also apply Bayesian methods to evaluate the uncertainty of this model. In terms of theory, we apply a new method — neural tangent kernel (NTK) — to study the training dynamics of an infinitely wide neural network. We compare the eigenvalues of the NTK of a vanilla neural network with that of a binary weight neural network, and show that the latter decays faster. This explains why binary weight neural networks have lower generalization gap empirically. We also examine the effect of weight decay on a neural network, and demonstrate that it induces sparsity in both a parallel neural network and a ResNet, thus prove that neural networks are locally adaptive, which is not present in any linear method, including kernels.
For the problems discussed above, we present both theoretical analyses of our method or statement, and numerical experiments to validate our conclusions.