Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Mean-Field Cooperative Multi-agent Reinforcement Learning: Modelling, Theory, and Algorithms

Abstract

In numerous stochastic systems involving a large number of agents, the model parameters and dynamics are typically not known beforehand. As a result, learning algorithms are crucial for these agents to enhance their decision-making abilities while engaging with the partially unknown system and interacting with other agents. In this case, multi-agent reinforcement learning (MARL) has enjoyed substantial successes for analyzing the otherwise challenging games arising from numerous fields including autonomous driving, supply chain, manufacturing, e-commerce and finance. Despite its empirical success, MARL suffers from the curse of dimensionality: its sample complexity by existing algorithms for stochastic dynamics grows exponentially with respect to the total number of agents $N$ in the system. This PhD thesis focuses on advancing the theoretical understandings and developing novel efficient algorithms with provable performance guarantees to solve large-population cooperative games using MARL and mean-field approximation.

The mean-field approximation of cooperative games in the regime with a large number of homogeneousagents is also known as mean-field control (MFC). It is therefore natural meanwhile important to consider the learning problem in MFCs. The first part of this dissertation focuses on investigating the learning framework of MFCs and establishing the corresponding dynamic programming principle (DPP). Dynamic programming principle is fundamental for control and optimization, including Markov decision problems (MDPs) and reinforcement learning (RL). However, in the learning framework of MFCs, DPP has not been rigorously established, despite its critical importance for algorithm designs. We first present a simple example in MFCs with learning where DPP fails with a mis-specified Q-function; and then propose the correct form of Q-function in an appropriate space for MFCs with learning. This particular form of Q-function is different from the classical one and is called the IQ-function. Compared to the classical Q-function in the single-agent RL literature, MFCs with learning can be viewed as lifting the classical RLs by replacing the state-action space with its probability distribution space. This identification of the IQ-function enables us to establish precisely the DPP in the learning framework of MFCs. The time consistency of this IQ-function is further illustrated through numerical experiments.

The second part of this dissertation focuses on addressing the curse of dimensionality in MARL with MFC approximations, and developing sample efficient learning algorithms. The mathematical framework to approximate cooperative MARL by MFC is rigorously established, with the approximation error of $\mathcal{O}(\frac{1}{\sqrt{N}})$. Furthermore, based on the DPP for both the value function and the Q-function of learning MFC, it introduces a model-free kernel-based Q-learning algorithm ({\MFCKQ}) with a linear convergence rate, which is the first of its kind in MARL literature. Empirical studies confirm the effectiveness of {\MFCKQ}, particularly for large-scale problems.

The other approach to reduce the sample complexity for cooperative MARL and learning MFC is to design efficient decentralized learning algorithms, in which each individual agent only requires local information of the entire system. In particular, little is known theoretically for decentralized MARL with network of states. The third study proposes a framework of localized training and decentralized execution for cooperative MARL with network of states and mean-field approximation, to study MARL systems such as self-driving vehicles, ride-sharing, and data and traffic routing. Localized training is to collect local information in agents' neighboring states for training; decentralized execution means to execute the learned decentralized policies that depend only on agents' current states. The theoretical analysis consists of three key components: the first is to establish the mean-field reformulation of the original MARL system as a networked MDP with teams of agents, enabling updating locally the associated team Q-function; the second is to develop the DPP for the mean-field type of Q-function for each team on the probability measure space; and the third is to analyze the exponential decay property of the Q-function, facilitating its approximation with sample efficiency and with controllable error. The analysis leads to a neural-network-based algorithm {\DECAC}, where the actor-critic approach is coupled with over-parameterized neural networks. Convergence and sample complexity of the algorithm are established and shown to be scalable with respect to the size of agents and states.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View