The field of artificial intelligence has seen serious progress in recent years, and has also caused serious concerns that range from the immediate harms caused by systems that replicate harmful biases to the more distant worry that effective goal-directed systems may, at a certain level of performance, be able to subvert meaningful control efforts. In this dissertation, I argue the following thesis: 1. The use of incomplete or incorrect incentives to specify the target behavior for an autonomous system creates a value alignment problem between the principal(s), on whose behalf a system acts, and the system itself; 2. This value alignment problem can be approached in theory and practice through the development of systems that are responsive to uncertainty about the principal’s true, unobserved, intended goal; and 3. Value alignment problems can be modeled as a class of cooperative assistance games, which are computationally similar to the class of partially-observed Markov decision processes. This model captures the principal’s capacity to behave strategically in coordination with the autonomous system. It leads to distinct solutions to alignment problems, compared with more traditional approaches to preference learning like inverse reinforcement learning, and demonstrates the need for strategically robust alignment solutions.
Chapter 2 goes over background knowledge needed for the work. Chapter 3 argues the first part of the thesis. First, in Section 3.1 we consider an order-following problem between a robot and a human. We show that improving on the human player’s performance requires that the robot deviate from the human’s orders. However, if the robot has an incomplete preference model (i.e., it fails to model properties of the world that the person cares about), then there is persistent misalignment in the sense that the robot takes suboptimal actions with positive probability indefinitely. Then, in Section 3.2, we consider the problem of optimizing an incomplete proxy metric and show that this phenomenon is a consequence of incompleteness and shared resources. That is, we provide general conditions under which optimizing any fixed incomplete representation of preferences will lead to arbitrarily large losses of utility for the human player. We identify dynamic incentive protocols and impact minimization as theoretical solutions to this problem.
Next, Chapter 4 deals with the second part of the thesis. We first show, in Section 4.1, that uncertainty about utility evaluations creates incentives to get supervision from the human player. Then, in Section 4.2 and Section 4.3, we demonstrate how to use uncertainty about utility evaluations to implement reward learning approaches that penalize negative side-effects and support dynamic incentive protocols. Specifically, we show how to apply Bayesian inference to learn a distribution over potential true utility functions, given the observation of a proxy in a specific development context.
Chapter 5 deals with the third part of the thesis. We introduce cooperative inverse reinforcement learning (CIRL), which formalizes the base case of assistance games. CIRL models dyadic value alignment between a human principal H and a robot assistant R. This game-theoretic framework models H’s incentive to be pedagogic. We show that pedagogical solutions to value alignment can be substantially more efficient than methods based on, e.g., imitation learning. Additionally, we provide theoretical results that support a family of efficient algorithms for CIRL that adapt standard approaches for solving POMDPs to compute pedagogical equilibria.
Finally, Chapter 6 considers the final component of the thesis, the need for robust solutions that can handle strategy variation on the part of H. We introduce a setting where R assists H in solving a multi-armed bandit. As in Section 3.1, H’s actions tell R which of the k different arms to pull. However, this introduces the complication that H does not know which arm is optimal a priori. We show that this setting admits efficient strategies where H treats their actions as purely communicative. These communication solutions can achieve optimal learning performance, but perform arbitrarily poorly if the encoding strategy used by H is misaligned with R’s decoding strategy.
We conclude with a discussion of related work in Chapter 7 and proposals for future work in Chapter 8.