Multi-frame Video Prediction with Learnable Temporal Motion Encodings
While recent deep learning methods have made significant progress on the video prediction problem, most methods predict the immediate or a fixed number of future frames. To obtain longer-term frame predictions, existing techniques usually process the predicted frames iteratively, resulting in blurry or inconsistent predictions. In this thesis, we present a new approach that can predict an arbitrary number of future video frames with a single forward pass through the network. Instead of directly predicting a fixed number of future optical flows or frames, we learn temporal motion encodings, i.e., temporal motion basis vectors and a network to predict the coefficients. The learned motion basis can be easily extended to arbitrary length at inference time, enabling us to predict an arbitrary number of future frames. Experiments on benchmark datasets indicate that our approach performs favorably against state-of-the-art techniques even for the next frame prediction setting. When evaluated under 5-frame or 10-frame prediction settings, the proposed method obtains bigger performance gains over the existing state-of-the-art techniques that iteratively process the predictions.