Intelligent agents perform computations to help them decide how to act. But which computations will best inform these decisions? How long should an agent think before acting? Given the impracticality of performing all possible relevant computations, AI systems need effective strategies for metalevel control—that is, for choosing computations that contribute most to the (object-level) action selection problem.
This thesis presents a theoretical basis for metalevel control problems, situating it within the frameworks of Bayesian selection problems and Markov decision processes. We present some fundamental results concerning the structure of these problems and the nature of their solutions. These results establish bounds on the number of computations performed by optimal policies and clarify how the context of multiple actions affects the choice of computation.
For a more empirical investigation of metalevel control, we apply reinforcement learning techniques to the problem of controlling Monte Carlo tree search. This requires learning a metalevel policy to choose which computation to perform as a function of the internal state of the tree search. To represent an learnable class of efficient policies, we describe how Monte Carlo tree search can be implemented using pointed trees, a recursive data structure that allows efficient evaluation of recursive functions and their derivatives. We propose a concrete class of policies that includes UCT and AlphaGo as special cases, along with a method for providing metalevel shaping rewards. We show that by initializing the metalevel policy to UCT, reinforcement learning can find metalevel policies that outperform UCT for smaller numbers of computations in the domain of computer Hex.
These results demonstrate the benefits of a solid theoretical understanding of metalevel control, as well as the potential for using metalevel reinforcement learning to replace hand- crafted algorithm design.