## Learning as a Sampling Problem

- Author(s): Stadie, Bradly C
- Advisor(s): Bartlett, Peter
- et al.

## Abstract

The past five years have seen rapid proliferation of work on deep learning: learning algorithms that utilize deep neural networks for nonlinear function approximation. Although this proliferation had its roots in supervised learn- ing, it subsequently spread to numerous other learning problems including reinforcement learning, imitation learning, meta learning, and unsupervised learning. Today, deep learning enables a variety of previously unobtainable capabilities:

1. Computers can play complex video games from raw images

2. Unsupervised learning algorithms can generate photo-realistic bedroom images from scratch without a reference

3. Robots can learn by copying other robot behavior. This imitation is quite robust and does not falter even when the demonstrated behavior is complex, abstract, or demonstrated sub-optimally.

4. The world’s best translation and text to speech engines

5. One-shot image classification

Yet, in spite of myriad successes, any deep learning practitioner will quickly run into difficulties when applying many of these learning algorithms to a novel problem. Everything is hard and nothing works easily. This thesis was born out of the difficulties I experienced while working through many problems in the fields of meta learning, reinforcement learning, and imitation learning. It is an attempt to fix many frustrating gaps in the prior art.

The first problem we consider is the exploration vs. exploitation dilemma in high-dimensional control problems with an image input space. We provide a practical algorithm to overcome the exploration vs. exploitation dilemma in this setting. This algorithm shrewdly makes use of a learned dynamics model to asses a transition’s novelty. This dynamics model has the benefit of being fast to train and generalizable. We show that using this learned dynamics model to incentivise exploration leads to massive gains on several difficult Atari games.

The second problem we consider is a good deal more technical, and deals largely with fixing certain mathematical dependencies in the computational graph of meta reinforcement learning algorithms. In particular, we show that policy-gradient-like algorithms for meta learning must take care to correctly compute the gradient of the meta learner with respect to the task-specific learners. We argue that fixing this dependency issue leads to better exploratory behavior in meta learned agents.

The third problem comes from the field of imitation learning. In imitation learning, agents typically imitate other identical agents. Moreover, it is al- ways assumed that the agent’s perspective while learning is identical to the perspective of the agent it is trying to imitate. In other words, agents do not learn by watching other agents. Instead, they learn by watching an exact replica of themselves completing a task. This is a strong and impractical assumption. We remove it by introducing algorithms for third-person imitation. These algorithms allow agents to learn by watching different agents, and not just copies of themselves.

The final problem we consider comes from the field of causal inference. In causal inference, each experiment is typically treated as independent. All treatment effect estimation is thus done in a vacuum without consideration to other relevant experiments. To rectify this shortcoming, we develop the idea of deep causal transfer learning. By modifying some ideas from transfer reinforcement learning, we are able to train neural networks that can rapidly learn new treatment effects and causal relationships.

All of these problems can be derived when one takes the perspective that learning is a sampling problem. That is, many learning problems amount to analyzing a sampling distribution over a state space. For reinforcement learning, we will see that the underlying data distribution we wish to optimize over is not stationary as in supervised learning. Instead, it is sampled directly from the policy we are optimizing over. Furthermore, many common methods of optimizing this policy rely heavily on our ability to sample from the policy and do not require, for instance, derivatives with respect to the true reward. For the meta learning algorithms we consider, we will see the fact that we are optimizing over the data-generating process is an important consideration. Taking this consideration into account, we derive new meta learning gradients that account for the impact of task-specific sampling distributions on the meta sampling distribution. For imitation learning, we see that the problem is to sample data that matches some unknown expert distribution. Although we do not know the expert’s true sampling distribution, we do have access to samples. We can use these samples to guide the imitator towards sampling from the correct expert distribution. Finally, we would like to one day allow agents to sample over causal relationships in their environment. This is in contrast to the present-day reality that sampling is almost always considered over low-level states or hierarchical constructs like options. The above arguments relating learning and sampling can be used to derive all of the problems we consider in this thesis. In this thesis, we will make these derivations explicit. We have thus chosen to title the thesis, ‘Learning as a sampling problem.’