 Main
NearOptimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model
Abstract
In this paper we consider the problem of computing an $\epsilon$optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any stateaction pair samples from the transition function in $O(1)$ time. Given such a DMDP with states $S$, actions $A$, discount factor $\gamma\in(0,1)$, and rewards in range $[0, 1]$ we provide an algorithm which computes an $\epsilon$optimal policy with probability $1  \delta$ where \emph{both} the time spent and number of sample taken are upper bounded by \[ O\left[\frac{SA}{(1\gamma)^3 \epsilon^2} \log \left(\frac{SA}{(1\gamma)\delta \epsilon} \right) \log\left(\frac{1}{(1\gamma)\epsilon}\right)\right] ~. \] For fixed values of $\epsilon$, this improves upon the previous best known bounds by a factor of $(1  \gamma)^{1}$ and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing $\epsilon$optimal policies for finitehorizon MDP with a generative model and provide a nearly matching sample complexity lower bound.
Many UCauthored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.
Main Content
Enter the password to open this PDF file:













