Multi-armed bandits (MAB) problem is a basic setting for sequential decision-making problemswith partial information. To get a good algorithm for MAB, we must balance the trade-off between
exploitation and exploration, i.e., whether we should take the action that works well before or take
other actions to gain more information.
Differential privacy (DP) is a common notion for protecting the privacy of individuals by
ensuring that the output distribution of the algorithms will not change a lot if we manipulate the
data point from an individual. Therefore, the adversary cannot identify an individual by looking at
the outputs of a DP algorithm. In other words, the outputs are ”stable” if we perturb the inputs. It
implies that DP learning algorithms are slow learners, but they are more robust than their non-DP
siblings. We can use this property to balance the trade-off in MAB problems.
In this work, we define a notion of stability motivated by DP, called Distributional Stability,
for randomized MAB algorithms. We study this stability in the Stochastic MAB problems. We
prove that if a randomized MAB algorithm has the stability and the output of this algorithm satisfies
some accuracy guarantee, it attains regret bounds polynomial in log(T).