- Main
Why do Gradient Methods Work in Optimization and Sampling?
Abstract
Modern machine learning models are complex, hierarchical, and large-scale and are trained using non-convex objective functions. The algorithms used to train these models, however, are incremental, first-order gradient-based algorithms like gradient descent and Langevin Monte Carlo. Why and when do these seemingly simple algorithms succeed?
This question is the focus of this thesis. We will consider three problems. The first problem involves the training of deep neural network classifiers using the logistic loss function with gradient descent. We establish conditions under which gradient descent drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU activation function, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
The second pertains to the problem of sampling from a strongly log-concave density. We provide an information theoretic lower bound on the number of stochastic gradient queries of the log density needed to generate a sample. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all of these algorithms.
The final problem involves sampling from a distribution whose log-density is non-smooth. We show that a slight modification of the Langevin Monte Carlo algorithm can be used to generate samples from such distributions in polynomial-time. We also provide non-asymptotic guarantees on the rate of convergence of this algorithm.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-