Extended UCB Policy for Multi-Armed Bandit with Light-Tailed Reward
Distributions
Published Web Location
https://arxiv.org/pdf/1112.1768.pdfAbstract
We consider the multi-armed bandit problems in which a player aims to accrue reward by sequentially playing a given set of arms with unknown reward statistics. In the classic work, policies were proposed to achieve the optimal logarithmic regret order for some special classes of light-tailed reward distributions, e.g., Auer et al.'s UCB1 index policy for reward distributions with finite support. In this paper, we extend Auer et al.'s UCB1 index policy to achieve the optimal logarithmic regret order for all light-tailed (or equivalently, locally sub-Gaussian) reward distributions defined by the (local) existence of the moment-generating function.
Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.