For an intelligent agent to interact with the environment efficiently, it must have the ability to predict, plan and generalize. This thesis studies how an intelligent agent can learn to predict future observations and leverage the predictive models for efficient policy learning and generalization. The four instances in this thesis are on high-fidelity video prediction, video prediction that handles multi-modal data distribution, predictive model-based reinforcement learning, and model-based zero-shot policy generalization. In the first case, we use a model that disentangles motion and appearance to predict high-fidelity images. We find this method can alleviate the blurry artifact and shape deformation inherited in previous methods. In the second case, we propose to use an example-guided model in the face of the multi-modal distribution of real-world data. The proposed method can predict diverse, multi-modal data that can also generalize well. In the third instance, we propose a model-based reinforcement learning method with theoretical guarantees. Specifically, we propose a novel value discrepancy loss for predictive model training. We experimentally also prove such framework and loss will significantly improve sample efficiency. Finally, we propose a method that learns both the dynamics model as well as the value of regions for zero-shot policy generalization. We show that this approach can generalize without finetuning to novel tasks. This thesis proposes several methods toward learning and using better predictive models to achieve policies efficiently.