Cameras can naturally capture sequences of images, or videos, and for computers to understand videos, they must track to connect the past with the present. We focus on two problems which challenge current state-of-the-art trackers. First, we address the challenge of long-term occlusion. For this challenge, a tracker must know when it has lost track and how to reinitialize tracking when the target reappears. We tackle reinitialization by building good appearance models for humans and hands, with a particular emphasis on robustness and occlusion. For the second challenge, appearance variation, the tracker must know when and how to re-learn (or update) an appearance model. Common solutions to this challenge encounter the classic problem of drift: aggressively learning putative appearance changes allows small errors to compound, as elements of the background environment pollute the appearance model. We propose two solutions. First, we consider self-paced learning, wherein a tracker begins by learning from frames it finds easy. As the tracker becomes better at recognizing the target, it begins to learn from harder frames. We also develop a data-driven approach in which we train a tracking policy to decide when and how to update an appearance model. To take this direct approach to “learning when to learn”, we exploit large-scale Internet data through reinforcement learning. We interpret the resulting policy and conclude with extensions for tracking multiple objects. By solving these tracking challenges, we advance applications in augmented reality, vehicle automation, healthcare, and security.