- Main
Robust Task Specification for Learning Systems
- Toyer, Sam
- Advisor(s): Russell, Stuart
Abstract
This dissertation considers how to evaluate and improve the robustness of AI systems in situations that are systematically different to those encountered during training. Specifically, we focus on test-time robustness for two particular ways of specifying tasks, and two specific forms of generalization. The first part of this dissertation focuses on learning tasks from demonstrations with imitation, while the second focuses on specifying tasks for large language models using natural language instructions.
In the first part, we specifically consider the combinatorial and in-distribution generalization of imitation learning. Our first contribution is a benchmark for how well learned policies can generalize along various axes. The benchmark allows us to manipulate these axes independently to determine invariances and equivariances the policy has. Using this benchmark, we show that some basic computer vision techniques (augmentation, egocentric views) improve imitative generalization, but more sophisticated representation learning techniques do not.
In the second part, we consider instruction-following language models and adversarial robustness, where a user is actively trying to provoke errors from the model. Here we contribute a large dataset of prompt injection attacks obtained from an online game, which we distill into a benchmark for language model robustness. We also consider a second type of adversarial attack called a jailbreak, and show that existing evaluations are insufficient to gauge the actual misuse potential of jailbreaking techniques. Thus we propose a new benchmark that identifies effective jailbreaks while correctly disregarding ineffective ones.
This dissertation proposes several evaluations for challenging problems where existing algorithms fail: imitation learning algorithms struggle to generalize when only few demonstrations are available, and representation learning is not an easy fix. Likewise, the safeguards around large language models are easy for an adversary to subvert. These negative results point toward ways that AI systems could be improved to be more robust in unexpected circumstances; we describe these opportunities for future work in Chapter 6.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-