Optimization is a natural language in which to express a multitude of problems from all reaches of the mathematical sciences. This is a dissertation in three parts, detailing work on three different problems in optimization.
We begin in stochastic thermodynamics, with the problem of computing optimal operating protocols for Brownian machines by minimizing their dissipation on average. We develop a perturbative solution to the Fokker-Planck equation for one-dimensional driven Brownian motion in the limit of slow driving. Calculating the average dissipation in this formalism, we demonstrate the existence of an emergent Riemannian geometric structure in the space of control parameters, whereby the $n^{th}$ order term in the average dissipation contains a tensor of order $n$. We rigorously derive an exact formula for the order-$2$ tensor (the metric) and show that up to second-order terms in the perturbation theory, optimal dissipation-minimizing driving protocols minimize the length defined by this metric.
Next, we move to the world of machine learning, and study the generalization properties of models trained with first-order optimizers on whitened data, and with exact second-order optimizers. For a general class of models trained with first-order optimizers, we find that the mutual information between the trained model and the dataset is strictly smaller when the data is whitened than when it isn't. This reduction of access to information relevant for test performance negatively impacts generalization in a dimension-dependent manner, worsening for high dimensional problems. Exploiting analogies between training linear models and wide neural networks with a first-order optimizer on whitened data and with an exact second-order optimizer, we also establish the existence of a generalization gap between first- and exact second-order optimizers in these model classes. Experiments demonstrate that our conclusions hold in a larger class of models than is supported by the theoretical results, but that regularizing a second-order optimizer appropriately can compensate for the reduction in generalization performance observed in the exact setting.
Finally, we study the problem of setting step sizes for discrete-time gradient-based optimization. Taking the view that a discrete-time optimizer is a simulation of an underlying ordinary differential equation (ODE), we set step sizes for the former by controlling the error with which it tracks the latter. The resulting adaptive step size mechanism is highly efficient, requiring just one extra gradient call over all iterations, and applicable to any optimizer with a well-defined continuous-time representation as an ODE. We find empirically that this step size routine can achieve oracle lower bounds on strongly convex functions for first- and second-order optimizers, and that it can also perform competitively with the state of the art on a nonconvex problem, principal components analysis.