Reward discounting has become an indispensable ingredient in designing practical reinforcement learning (RL) algorithms. However, standard notions of regret in theoretical RL are not capable of capturing the essence of reward discounting and, as such, fail to serve as optimality criteria for analyzing practical RL algorithms. Three questions naturally arise:
(Q1) Can we have a different notion of regret that encapsulates the idea of reward discounting?(Q2) Can we design RL algorithms, potentially with reward discounting, that can be analyzed under this different notion?
(Q3) Can we make these algorithms tackle non-linear problems effectively and efficiently so that they can compete with widely-used RL algorithms on standard benchmarks?
We address these three questions in this dissertation. To answer (Q1), we introduce a new notion of regret, named $\gamma$-regret, to capture the concept of regret in discounted RL. The parameter $\gamma$ in $\gamma$-regret serves as an alternative to the horizon or the diameter in traditional RL analysis. The definition of $\gamma$-regret not only captures the idea of reward discounting in prevalent RL algorithms but also enables rigorous theoretical analysis of these practices. It is also deeply connected with other regret-related notions in the existing RL literature.
Under the $\gamma$-regret framework, both algorithm design and theoretical analysis become more challenging and require innovation. Nonetheless, we are able to make significant progress toward answering (Q2). In particular, we first derive a lower bound of $\Omega\left(\sqrt{T/(1 - \gamma)}\right)$, where $T$ is the total number of interactions, on the $\gamma$-regret under the tabular setting. We then introduce two algorithmic instantiations of the $Q$-learning paradigm, tabular double Q-learning and kernelized Q-learning, both of which are amenable to theoretical analysis under the $\gamma$-regret framework. In particular, we obtain an upper bound of $\tilde{O}\left(\sqrt{T / (1 - \gamma)^3}\right)$ and $\tilde{O}\left(\sqrt{T / (1 - \gamma)^5}\right)$ respectively on the $\gamma$-regret for these two algorithms under various settings; notably, tabular double Q-learning is the first RL algorithm that has near-optimal $\gamma$-regret, and since it was introduced, many more RL algorithms has been designed and analyzed under the $\gamma$-regret framework.
We also take a big step towards answering (Q3). Specifically, we complement our analysis of kernelized Q-learning with experiments on classic control tasks; remarkably, kernelized Q-learning is the first RL algorithm that can nearly solve the MountainCar environment in as few as one thousand steps.