Learning in Mean-Field Games and Continuous-Time Stochastic Control Problems
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Learning in Mean-Field Games and Continuous-Time Stochastic Control Problems

Abstract

In recent years, there has been an ever-increasing demand for building reliable and versatile agents in applications arising from numerous fields including autonomous driving, supply chain, manufacturing, e-commerce and finance. To meet these challenging demands, researches in decision making systems have drawn upon a wide range of tools from applied probability, reinforcement learning (RL), stochastic control and game theory. This dissertation focuses on developing new methodologies and efficient algorithms with provable performance guarantees to deal with complex environments such as large population competitions and continuous-time systems.

The first part of this dissertation focuses on designing and analyzing RL algorithms for large population games. Large population games have appeared in many real-world problems. Examples include massive multiplayer online role-playing games, high frequency trading, and the sharing economy. However, in general, it becomes increasingly difficult to solve such problems as the number of players in the game grows. Mean field game (MFG) provides an ingenious and tractable aggregation approach to approximate the otherwise challenging N-player stochastic games. In Chapter 1, we present a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It first establishes the existence of a unique Nash Equilibrium to this GMFG, and demonstrates that naively combining reinforcement learning with the fixed-point approach in classical MFGs yields unstable algorithms. It then proposes value-based and policy-based reinforcement learning algorithms (GMF-V and GMF-P, respectively) with smoothed policies, with analysis of their convergence properties and computational complexities. Experiments on an equilibrium product pricing problem demonstrate that GMF-V-Q and GMF-P-TRPO, two specific instantiations of GMF-V and GMF-P, respectively, with Q-learning and TRPO, are both efficient and robust in the GMFG setting. Moreover, their performance is superior in convergence speed, accuracy, and stability when compared with existing algorithms for multi-agent reinforcement learning in the $N$-player setting.

The second part of this dissertation focuses on designing and analyzing RL algorithms for continuous-time stochastic dynamical systems. As most physical systems in science and engineering evolve continuously in time, many real-world control tasks, such as those in aerospace, automotive industry and robotics, are naturally formulated in terms ofcontinuous-time dynamical systems. Nevertheless, the mainstream RL algorithms have been designed for discrete-time systems, despite that they are widely applied to physical tasks in continuous-time systems. Continuous-time RL algorithms have also been developed in the past decades. But the theoretical guarantees of these works are limited to the asymptotic convergence and the non-asymptotic guarantees remain unknown. In Chapter 2, we take the first step towards designing algorithms with non-asymptotic guarantees for solving finite-time-horizon continuous-time linear quadratic (LQ) RL problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of magnitude $O((\ln M)(\ln\ln M))$, with $M$ being the number of learning episodes. The analysis consists of two components: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm. In Chapter 3, we extend the results beyond linear-quadratic problems, where the unknown linear jump-diffusion process is controlled subject to non-smooth convex costs. We show that the associated linear-convex (LC) control problems admit Lipchitz continuous optimal feedback controls and further prove the Lipschitz stability of the feedback controls. The analysis relies on a stability analysis of the associated forward-backward stochastic differential equation. We then propose a least-squares algorithm which achieves a regret of the order $O(\sqrt{N\ln N})$ on linear-convex learning problems with jumps, where $N$ is the number of learning episodes; the analysis leverages the Lipschitz stability of feedback controls and concentration properties of sub-Weibull random variables. Numerical experiment confirms the convergence and the robustness of the proposed algorithm.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View