 Main
MeanField Cooperative Multiagent Reinforcement Learning: Modelling, Theory, and Algorithms
 Gu, Haotian
 Advisor(s): Guo, Xin;
 Rezakhanlou, Fraydoun
Abstract
In numerous stochastic systems involving a large number of agents, the model parameters and dynamics are typically not known beforehand. As a result, learning algorithms are crucial for these agents to enhance their decisionmaking abilities while engaging with the partially unknown system and interacting with other agents. In this case, multiagent reinforcement learning (MARL) has enjoyed substantial successes for analyzing the otherwise challenging games arising from numerous fields including autonomous driving, supply chain, manufacturing, ecommerce and finance. Despite its empirical success, MARL suffers from the curse of dimensionality: its sample complexity by existing algorithms for stochastic dynamics grows exponentially with respect to the total number of agents $N$ in the system. This PhD thesis focuses on advancing the theoretical understandings and developing novel efficient algorithms with provable performance guarantees to solve largepopulation cooperative games using MARL and meanfield approximation.
The meanfield approximation of cooperative games in the regime with a large number of homogeneousagents is also known as meanfield control (MFC). It is therefore natural meanwhile important to consider the learning problem in MFCs. The first part of this dissertation focuses on investigating the learning framework of MFCs and establishing the corresponding dynamic programming principle (DPP). Dynamic programming principle is fundamental for control and optimization, including Markov decision problems (MDPs) and reinforcement learning (RL). However, in the learning framework of MFCs, DPP has not been rigorously established, despite its critical importance for algorithm designs. We first present a simple example in MFCs with learning where DPP fails with a misspecified Qfunction; and then propose the correct form of Qfunction in an appropriate space for MFCs with learning. This particular form of Qfunction is different from the classical one and is called the IQfunction. Compared to the classical Qfunction in the singleagent RL literature, MFCs with learning can be viewed as lifting the classical RLs by replacing the stateaction space with its probability distribution space. This identification of the IQfunction enables us to establish precisely the DPP in the learning framework of MFCs. The time consistency of this IQfunction is further illustrated through numerical experiments.
The second part of this dissertation focuses on addressing the curse of dimensionality in MARL with MFC approximations, and developing sample efficient learning algorithms. The mathematical framework to approximate cooperative MARL by MFC is rigorously established, with the approximation error of $\mathcal{O}(\frac{1}{\sqrt{N}})$. Furthermore, based on the DPP for both the value function and the Qfunction of learning MFC, it introduces a modelfree kernelbased Qlearning algorithm ({\MFCKQ}) with a linear convergence rate, which is the first of its kind in MARL literature. Empirical studies confirm the effectiveness of {\MFCKQ}, particularly for largescale problems.
The other approach to reduce the sample complexity for cooperative MARL and learning MFC is to design efficient decentralized learning algorithms, in which each individual agent only requires local information of the entire system. In particular, little is known theoretically for decentralized MARL with network of states. The third study proposes a framework of localized training and decentralized execution for cooperative MARL with network of states and meanfield approximation, to study MARL systems such as selfdriving vehicles, ridesharing, and data and traffic routing. Localized training is to collect local information in agents' neighboring states for training; decentralized execution means to execute the learned decentralized policies that depend only on agents' current states. The theoretical analysis consists of three key components: the first is to establish the meanfield reformulation of the original MARL system as a networked MDP with teams of agents, enabling updating locally the associated team Qfunction; the second is to develop the DPP for the meanfield type of Qfunction for each team on the probability measure space; and the third is to analyze the exponential decay property of the Qfunction, facilitating its approximation with sample efficiency and with controllable error. The analysis leads to a neuralnetworkbased algorithm {\DECAC}, where the actorcritic approach is coupled with overparameterized neural networks. Convergence and sample complexity of the algorithm are established and shown to be scalable with respect to the size of agents and states.
Main Content
Enter the password to open this PDF file:













