Reinforcement Learning in Structured and Partially Observable Environments
Sequentially making-decision abounds in real-world problems ranging from robots needing to interact with humans to companies aiming to provide reasonable services to their customers. It is as diverse as self-driving cars, health-care, agriculture, robotics, manufacturing, drug discovery, and aerospace. Reinforcement Learning (RL), as the study of sequential decision-making under uncertainty, represents a core aspect challenges in real-world applications.
While most of the practical application of interests in RL are high dimensions, we study RL problems from theory to practice in high dimensional, structured, and partially observable settings. We show how statistically develop efficient RL algorithm for a variety of RL problems, from recommendation systems to robotics and games. We theoretically study these problems from their first principles to provide RL agents which efficiently interact with their surrounding environment and learn the desired behavior while minimizing their regrets.
We study linear bandit problems where we propose Projected Stochastic Linear Bandit (PSLB), upper confidence bound based algorithm in linear bandit which exploit the intrinsic structure of the decision-making problem to significantly enhance the performance of RL agents.
We study the problem of RL in Markov Decision Process (MDP) where we propose the first sample efficient model-free algorithm for the general continuous state and action space MDPs. We further investigate safe RL setting and introduce a safe RL algorithm to avoid catastrophic mistakes that can be made by an RL agent. We extensively study tree-based methods, a well-popularized method in RL which is also the core to Alpha-Go, a technique to beat the masters of board games such as Go game.
We extend our study to partially observable environments, such as partially observable Markov decision processes (POMDP) where we propose the first regret analysis for the class of memoryless policies. We continue this study to a class of problems known as rich observable Markov decision processes (ROMPD) and propose the first regret bound with no dependency in the ambient dimension in the dominating terms.
We empirically study the significance of all these theoretically guaranteed methods and show their value in practice.