Efficient Second-Order Methods for Non-Convex Optimization and Machine Learning
Hessian-based analysis/computation is widely used in scientific computing. However, due to the (incorrect, but in our experience widespread) belief that Hessian-based computations are infeasible for large machine learning (ML) problems, the majority of work in ML (except for quite small problems) only performs the first-order methods. However, using sub-sampling and randomized numerical linear algebra algorithms, the computation of second-order methods can be efficiently extracted for large-scale machine learning problems. In this thesis, we consider three use cases of second-order methods as follows: (i) For non-convex optimization and/or ML problems, we propose inexact variants of three classic Newton-type methods---Trust Region method, Cubic Regularization method, and Newton-CG method, as well as a novel adaptive second-order method---AdaHessian. For classic methods, under the relaxed gradient and Hessian approximation, we theoretically prove that our methods achieve the same complexity as their full gradient and Hessian counterparts, and empirically show the efficiency and robustness of those Newton-type algorithms. For the novel AdaHessian algorithm, we incorporate moving average for the Hessian information, and illustrate its superb performance on large-scale deep learning problems compared to first-order optimizers; (ii) For machine learning and data science analysis, we develop a popular Hessian-based computation framework. The framework enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it can be used to analyze neural networks (NNs), including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers; (iii) For application, first, we show that Hessian-based metric can be used as the sensitivity metric to determine the robustness of every single layer of the NNs, and this can help perform mixed-precision quantization. Second, we demonstrate that second-order based adversarial attacks can achieve smaller perturbation as compared to first-order methods, which can be used as a powerful attack method to verify the robustness of a NN to adversarial attacks.