Offline Policy Evaluation (OPE) aims at evaluating the expected cumulative reward of a target policy $\pi$ when offline data are collected by running a logging policy $\mu$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon''. In the Second chapter of this thesis, we prove the modification of \emph{Marginalized Importance Sampling} (MIS) method can achieve the Cramer-Rao lower bound, provided that the state space and the action space are finite.
In the Third chapter of the thesis, we go beyond the off-policy evaluation setting and propose a new uniform convergence for OPE. The Uniform OPE problem requires evaluating all the policies in a policy class $\Pi$ simultaneously, and we obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $O(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the time-inhomogeneous episodic MDP model. Here $d_m$ is the minimal marginal state-action visitation probability for the current MDP under the behavior policy $\mu$. We further improve the sample complexity guarantee to $O(H^2/d_m\epsilon^2)$ under the time-homogeneous episodic MDPs, using a novel singleton-absorbing MDP technique in the Fourth chapter. Both results are known to be optimal under their respective settings. In the final part of the thesis, we summarise our other work in reinforcement learning and conclude with potential future directions.