## Efficient Sequential Decision Making

- Author(s): Malek, Alan
- Advisor(s): Bartlett, Peter L
- et al.

## Abstract

This thesis studies three problems in sequential decision making

across two different frameworks. The first framework we consider is

online learning: for each round of a $T$ round repeated game, the

learner makes a prediction, the adversary observes this prediction

and reveals the true outcome, and the learner suffers some loss

based on the accuracy of the prediction. The learner's aim is to

minimize the regret, which is defined to be the difference between

the learner's cumulative loss and the cumulative loss of the best

prediction strategy in some class. We study the minimax strategy,

which guarantees the lowest regret against all possible adversary

strategies. In general, computing the minimax strategy is

exponential in $T$; we focus on two setting where efficient

algorithms are possible.

The first is prediction under squared Euclidean loss. The learner

predicts a point in $\Reals^d$ and the adversary is constrained to

respond with a point in some compact set. The regret is with respect

to the single best prediction in the set. We compute the minimax

strategy and the value of the game for any compact set and show that

the value is the product of a horizon-dependent constant and the

squared radius of the smallest enclosing ball of the set. We also

present the optimal strategy of the adversary for two important

sets: ellipsoids and polytopes that intersect their smallest

enclosing ball at all vertices. The minimax strategy can be cast as

a simple shrinkage of the past data towards the center of this

minimum enclosing ball, where the shrinkage factor can be

efficiently computed before the start of the game. Noting that the

value does not have any explicit dimension dependence, we then

extend these results to Hilbert space, finding, once again, that the

value is proportional to the squared radius of the smallest

enclosing ball.

The second setting where we derive efficient minimax strategies is

online linear regression. At the start of each round, the adversary

chooses and reveals a vector of covariates. The

regret is defined with respect to the best linear function of the

covariates. We show that the minimax strategy is an easily computed

linear predictor, provided that the adversary adheres to some

natural constraints that prevent him from misrepresenting

the scale of the problem. This strategy is horizon-independent:

regardless of the length of the game, this strategy incurs no more

regret than any strategy that has knowledge of the number of

rounds. We also provide an interpretation of the minimax algorithm

as a follow-the-regularized-leader strategy with a data-dependent

regularizer and obtain an explicit expression for the minimax

regret.

We then turn to the second framework, reinforcement learning. More

specifically, we consider the problem of controlling a Markov

decision process (MDP) with a large state-space. Since it is

intractable to compete with the optimal policy for large scale

problems, we pursue the more modest goal of competing with a

low-dimensional family of policies. Specifically, we restrict the

variables of the dual linear program to lie in some low-dimensional

subspace, and show that we can find a policy that performs almost as

well as the best policy in this class. We derive separate results

for the average cost and discounted cost cases. Most importantly,

the complexity of our method depends on the size of the comparison

class but not the size of the state-space. Preliminary experiments

show the effectiveness of the proposed algorithms in a queuing

application.