Typically when learning about what people want and don't want, we look to human action as evidence: what reward they specify, how they perform a task, or what preferences they express can all provide useful information about what an agent should do. This is essential in order to build AI systems that do what we intend them to do. However, existing methods require a lot of expensive human feedback in order to learn even simple tasks. This dissertation argues that there is an additional source of information that is rather helpful: the state of the world.
The key insight of this dissertation is that when a robot is deployed in an environment that humans have been acting in, the state of the environment is already optimized for what humans want, and is thus informative about human preferences.
We formalize this setting by assuming that a human H has been acting in an environment for some time, and a robot R observes the final state produced. From this final state, R must infer as much as possible about H's reward function. We analyze this problem formulation theoretically and show that it is particularly well suited to inferring aspects of the state that should not be changed -- exactly the aspects of the reward that H is likely to forget to specify. We develop an algorithm using dynamic programming for tabular environments, analogously to value iteration, and demonstrate its behavior on several simple environments. To scale to high-dimensional environments, we use function approximators judiciously to allow the various parts of our algorithm to be trained without needing to enumerate all possible states.
Of course, there is no point in learning about H's reward function unless we use it to guide R's decision-making. While we could have R simply optimize the inferred reward, this suffers from a "status quo bias": the inferred reward is likely to strongly prefer the observed state, since by assumption it is already optimized for H's preferences. To get R to make changes to the environment, we will usually need to integrate the inferred reward with other sources of preference information. In order to support such reward combination, we use a model in which R must maximize an unknown reward function known only to H. Learning from the state of the world arises as an instrumentally useful behavior in such a setting, and can serve to form a prior belief over the reward function that can then be updated after further interaction with H.