AI applications have seen great advances in recent years, so much so that the people who design and build them often have difficulty understanding, predicting and controlling what they do. To make progress on these fundamental challenges, I believe that developing a solid mathematical foundation for AI is both beneficial and possible. The research presented in this dissertation is an attempt to chip away at a few small aspects of that endeavor.
The dissertation is divided into three parts.
Part I addresses a question at the core of learning theory: "how much data is necessary for supervised learning?" Concretely, Chapter 2 considers statistical settings, in which training data is drawn from an unknown distribution. Here, we answer a question posed by Antos and Lugosi (1996) concerning the shape of learning curves. To do so, we define a new combinatorial quantity, which we call the Vapnik--Chervonenkis--Littlestone dimension, and show that it characterizes the rate at which the learner's error decays. Our result has a number of additional benefits: it refines the trichotomy theorem of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021); qualitatively strengthens classic 'no free lunch' lower bounds; and establishes that, in the distribution-dependent setting, semi-supervised learning is no easier than supervised learning. Chapter 3 considers adversarial settings, in which few or no assumptions are made regarding the source of the training data. Specifically, we chart the landscape of transductive online learning, showing how it compares to the standard setting in online learning, and how it relates to combinatorial quantities such as the VC and Littlestone dimensions.
Part II investigates whether it is possible to verify the optimality of a machine learning outcome offered by an untrusted party, such that verification would be significantly cheaper (in terms of compute, or the quantity or quality of training data) compared to the cost of running a trusted machine learning system. This question has many parallels in the theory of computation, and it also has tangible implications to the economics of selling machine learning as a service. Our first contribution is to introduce a notion of interactive proofs for verifying machine learning. Here, the entity running the learning algorithm proves to the verifier that a proposed hypothesis is competitive with some benchmark, for instance, is sufficiently close to the best hypothesis in a reference class of hypotheses. Our primary focus is verifying agnostic supervised machine learning. Within this framework, we show a host of verification protocols and lower bounds, establishing that in some cases, verification can be significantly cheaper than learning, while in other cases it cannot. In particular, our results include: (1) for supervised learning, the sample complexity gap between learning and verifying is quadratic in some (natural) cases, and furthermore it can never be greater than quadratic; (2) whereas learning the class of Fourier-sparse boolean functions using i.i.d. samples is LPN-hard, we show that there exists an efficient protocol for verifying this class, wherein the verifier only uses i.i.d. samples.
Part III studies notions of stability in machine learning. We offer a taxonomy that can help make sense of an assortment of seemingly-unrelated stability definitions that have appeared in the learning theory literature. Our starting point is an observation that many of these definitions actually follow a similar abstract formulation. We call this the Bayesian formulation of stability, and we ask, to what extent are the various Bayesian definitions in the literature actually different from one another. To answer this question, we distinguish between two variants: distribution-dependent Bayesian stability, and distribution-independent Bayesian stability. Putting together results from a number of recent publications shows that many distribution-dependent Bayesian definitions, including approximate differential privacy, are in fact weakly equivalent to each other. To complete the picture, we investigate the family of distribution-independent Bayesian definitions. We show that here too, many definitions, including pure differential privacy, are weakly equivalent to each other. Our proof involves developing a boosting algorithm that simultaneously improves the accuracy and the stability of a learning algorithm.
The dissertation consists of five chapters, each of which is self-contained and can be read independently. However, a small number of basic notions crop up repeatedly in different guises. Taken together, the dissertation showcases the richness, versatility and unity of learning theory.