Distribution shift in machine learning refers to the general problem where a model is evaluated on test data drawn from a different distribution than the training data distribution. In real world applications, distribution shift is oftentimes not the exception, but the norm: for example, the data distribution may change over time, the model may be tested in new and unforeseen circumstances, or distribution shift may even be a consequence of the problem definition itself. In order to realize the full potential of machine learning, therefore, effective solutions must be developed to deal with distribution shift problems.
In this thesis, we first review the many ways that distribution shift arises in machine learning settings, using several examples to ground the problem while demonstrating the ubiquity with which the problem occurs. The first few of these examples come from control and reinforcement learning, and for these examples, distribution shift is baked in to the problem formulation itself. As the shifts are easily characterized and peripheral to the main goal, handcrafted techniques can be brought to bear to handle these cases. The subsequent examples illustrate ways in which shift creeps into real world supervised learning problems, which motivates the study and development of general purpose learning paradigms that can be used to tackle these and future examples.
The paradigms we focus on in this thesis revolve around the concept of adaptation: leveraging the information available at test time in order to change the model to better handle the test data. We ask an open-ended question: how can and should a model adapt when faced with distribution shift? Our first proposal for answering this question is the paradigm of adaptive risk minimization: when provided examples at training time of what different shifts are likely to occur, the model should learn to adapt to these training shifts, thus better preparing itself for similar instances of test shift. We formalize and instantiate methods for this paradigm through the toolkit of meta-learning, demonstrating that these methods are competitive with, and oftentimes superior to, prior approaches for handling distribution shift. Our second proposal for answering the above question lies at the other end of the spectrum: even without access to the training procedure or multiple test points, the model can still rely on inductive biases relayed by data augmentations in order to adapt. This leads to the method of marginal entropy minimization with one test point, a broadly applicable "slot-in replacement" for standard inference that proves to be effective for a wide range of commonly used models on a number of challenging distribution shift benchmarks.