The data explosion and development of artificial intelligence (AI) has fueled the demand for recommendation systems, information retrieval, personalization, among others. Consequently, the need of a solution to optimize these systems “on-the-fly” has also grown rapidly. Contextual bandit is a machine learning framework designed to tackle complex situations in an online manner, where the agent can select actions (i.e., arms) based on available context information. Based the feedback, the agent can learn the relations between context information and rewards for each arm, which further improves arm selection in the future. In practice, however, the learning environment may be far from being perfect. For example, the available context information may not be accurate, the reward feedback may be delayed or even missing, and data may not be centrally available due to user privacy concerns.
In this dissertation, we consider the practical scenario of contextual bandits in an imperfect environment. First, we focus on imperfect context and study learning with probabilistic contexts, where a bundle of contexts are revealed to the agent along with their corresponding probabilities instead of true context. Second, we study reward imperfectness by considering delayed or missing reward feedback. Third, we turn to an adversarial environment and study a novel combinatorial setting with arm removal and submodular utility where some selected arms can be removed adversarially. Finally, we consider a privacy-preserving federated bandit where a group of agents cooperate to solve the bandit problem, while ensuring that their communication remains private. For each of the settings, we propose new learning algorithms, analyze the cumulative regret, and conduct empirical evaluations based on real-world applications.