This thesis studies three problems in sequential decision making
across two different frameworks. The first framework we consider is
online learning: for each round of a $T$ round repeated game, the
learner makes a prediction, the adversary observes this prediction
and reveals the true outcome, and the learner suffers some loss
based on the accuracy of the prediction. The learner's aim is to
minimize the regret, which is defined to be the difference between
the learner's cumulative loss and the cumulative loss of the best
prediction strategy in some class. We study the minimax strategy,
which guarantees the lowest regret against all possible adversary
strategies. In general, computing the minimax strategy is
exponential in $T$; we focus on two setting where efficient
algorithms are possible.
The first is prediction under squared Euclidean loss. The learner
predicts a point in $\Reals^d$ and the adversary is constrained to
respond with a point in some compact set. The regret is with respect
to the single best prediction in the set. We compute the minimax
strategy and the value of the game for any compact set and show that
the value is the product of a horizon-dependent constant and the
squared radius of the smallest enclosing ball of the set. We also
present the optimal strategy of the adversary for two important
sets: ellipsoids and polytopes that intersect their smallest
enclosing ball at all vertices. The minimax strategy can be cast as
a simple shrinkage of the past data towards the center of this
minimum enclosing ball, where the shrinkage factor can be
efficiently computed before the start of the game. Noting that the
value does not have any explicit dimension dependence, we then
extend these results to Hilbert space, finding, once again, that the
value is proportional to the squared radius of the smallest
enclosing ball.
The second setting where we derive efficient minimax strategies is
online linear regression. At the start of each round, the adversary
chooses and reveals a vector of covariates. The
regret is defined with respect to the best linear function of the
covariates. We show that the minimax strategy is an easily computed
linear predictor, provided that the adversary adheres to some
natural constraints that prevent him from misrepresenting
the scale of the problem. This strategy is horizon-independent:
regardless of the length of the game, this strategy incurs no more
regret than any strategy that has knowledge of the number of
rounds. We also provide an interpretation of the minimax algorithm
as a follow-the-regularized-leader strategy with a data-dependent
regularizer and obtain an explicit expression for the minimax
regret.
We then turn to the second framework, reinforcement learning. More
specifically, we consider the problem of controlling a Markov
decision process (MDP) with a large state-space. Since it is
intractable to compete with the optimal policy for large scale
problems, we pursue the more modest goal of competing with a
low-dimensional family of policies. Specifically, we restrict the
variables of the dual linear program to lie in some low-dimensional
subspace, and show that we can find a policy that performs almost as
well as the best policy in this class. We derive separate results
for the average cost and discounted cost cases. Most importantly,
the complexity of our method depends on the size of the comparison
class but not the size of the state-space. Preliminary experiments
show the effectiveness of the proposed algorithms in a queuing
application.