Reinforcement learning (RL) is a powerful machine learning paradigm that studies the interaction between a single agent with an unknown environment. A plethora of applications fit into the RL framework, however, in many cases of interest, a team of agents will need to interact with the environment and with each other to achieve a common goal. This is the object study of collaborative multi-agent RL (MARL).
Several challenges arise when considering collaborative MARL. One of these challenges is decentralization. In many cases, due to design constraints, it is undesirable or inconvenient to constantly relay data between agents and a centralized location. Therefore, fully distributed solutions become preferable. The first part of this dissertation addresses the challenge of designing fully decentralized MARL algorithms. We consider two problems: policy evaluation and policy optimization. In the policy evaluation problem, the objective is to estimate the performance of a target team policy in a particular environment. This problem has been studied before for the case with streaming data, however, in most implementations the target policy is evaluated using a finite data set. For this case, existing algorithms guarantee convergence at a sub-linear rate. In this dissertation we introduce Fast Diffusion for Policy Evaluation (FDPE), an algorithm that converges at linear rate for the finite data set case. We then consider the policy optimization problem, where the objective is for all agents to learn an optimal team policy. This problem has also been studied recently, however, existing solutions are data inefficient and converge to Nash equilibria (whose performance can be catastrophically bad) as opposed to team optimal policies. For this case we introduce the Diffusion for Team Policy Optimization (DTPO) algorithm. DTPO is more data efficient than previous algorithms and does not converge to Nash equilibria. For both of these cases, we provide experimental studies that show the effectiveness of the proposed methods.
Another challenge that arises in collaborative MARL, which is orthogonal to the decentralization problem, is that of scalability. The parameters that need to be estimated when full team policies are learned, grow exponentially with the number of agents. Hence, algorithms that learn joint team policies quickly become intractable. A solution to this problem is for each agent to learn an individual policy, such that the resulting joint team policy is optimal. This problem has been the object of much research lately. However, most solution methods are data inefficient and often make unrealistic assumptions that greatly limit the applicability of these approaches. To address this problem we introduce Logical Team Q-learning (LTQL), an algorithm that learns factored policies in a data efficient manner and is applicable to any cooperative MARL environment. We show that LTQL outperforms previous methods in a challenging predator-prey task.
Another challenge is that of efficient exploration. This is a problem both in the single-agent and multi-agent settings, although in MARL it becomes more severe due to the larger state-action space. The challenge of deriving policies that are efficient at exploring the state space has been addressed in many recent works. However, most of these approaches rely on heuristics, and more importantly, they consider the problem of exploring the state space separately from that of learning an optimal policy (even though they are related, since the state-space is explored to collect data to learn an optimal policy). To address this challenge, we introduce the Information Seeking Learner (ISL), an algorithm that displays state of the art performance in difficult exploration benchmarks. The fundamental value of our work on exploration is that we take a fundamentally different approach from previous works. As opposed to earlier methods we consider the problem of exploring the state space and learning an optimal policy jointly. The main insight of our approach is that in RL, obtaining point estimates of the quantities of interest is not sufficient and confidence bound estimates are also necessary.