Minimax Optimality in Online Learning under Logarithmic Loss with Parametric Constant Experts
- Author(s): Hedayati, Fares
- Advisor(s): Bartlett, Peter L
- et al.
We study online prediction of individual sequences under logarithmic loss with parametric experts. The goal is to predict a sequence of outcomes, revealed one at a time, almost as well as a set of experts. At round t, the forecaster's prediction takes the form of a conditional probability density. The loss that the forecaster suffers at that round is the negative log of the conditional probability of the outcome revealed after the forecaster's prediction. The performance of the prediction strategy is measured relative to the best in a reference set of experts, a parametric class of i.i.d distributions. The difference between the accumulated loss of the prediction strategy and the best expert in the reference set is called the regret. We focus on the minimax regret, which is the regret of the strategy with the minimum of the worst-case regret over outcome sequences.
The minimax regret is achieved by the normalized maximum likelihood (NML) strategy. This strategy knows the length of the sequence in advance and the probability it assigns to each sequence is proportional to the maximum likelihood of the sequence. Conditionals are computed at each round by marginalization which is very costly for NML. Due to this drawback, much focus has been given to alternative strategies such as sequential normalized maximum likelihood (SNML) and Bayesian strategies. The conditional probability that SNML assigns to the next outcome is proportional to the maximum likelihood of the data seen so far and the next outcome.
We investigate conditions that lead to optimality of SNML and Bayesian strategies. A major part of this thesis is dedicated to showing that optimality of SNML and optimality of a certain Bayesian strategy, namely the Bayesian strategy under Jeffreys prior are equivalent to each other, i.e. if SNML is optimal, then so is the Bayesian strategy under Jeffreys prior and if the Bayesian strategy under the Jeffreys prior is optimal then so is SNML. Note that Jeffreys prior in parametric families is proportional to the square root of the determinant of the Fisher information. Furthermore we show that optimality of SNML happens if and only if the joint distribution on sequences defined by SNML is exchangeable, i.e. the probability that SNML assigns to any sequence is invariant under any permutation of the sequence. These results are proven for exponential families and any parametric family for which the maximum likelihood estimator is asymptotically normal. The most important implication of these results is that when SNML-exchangeability holds NML becomes horizon-independent, and it could be either calculated through a Bayesian update with Jeffreys prior or through a one step-ahead maximum likelihood calculation as in SNML. Another major part of this thesis is focused on showing that SNML-exchangeabilty holds for a large class of one-dimensional exponential family distributions, namely for Gaussian, the gamma, and the Tweedie exponential family of order 3/2, and any one-to-one transformation of them and that it cannot hold for other one-dimensional exponential family distributions.
Finally in this thesis we investigate horizon-dependent priors when Jeffreys prior is not optimal. Only Jeffreys prior can make a Bayesian strategy optimal. This means that if Jeffreys prior is not optimal then nor is any other prior, except for possibly a horizon-dependent prior. This is because if there does not exist a prior that can make the Bayesian strategy optimal for all horizons then the only possibilities are priors that depend on the horizon of the game. We investigate the behavior of a natural horizon-dependent prior called the NML prior. We show that the NML prior converges in distribution to Jeffreys prior, which makes it asymptotically optimal, but not necessarily optimal for an arbitrary horizon. Furthermore we show that there are exactly three families, namely Gaussian, gamma and inverse Gaussian, where the NML prior is equal to Jeffreys prior and hence horizon-independent. Two of these families namely gamma and Gaussian have optimal NML prior. We also investigate the problem of finding an optimal horizon-dependent prior for online binary prediction with Bernoulli experts. We could not solve this problem, but we describe insights gained from our investigation and possible directions that researchers can take in tackling this open problem.