Near-future prediction in videos has crucial impact on a wide range of practical applications which require anticipatory response. In videos, prediction can be performed in different spaces such as labels, captions and frames. Labels can be predicted for a longer horizon in future but are less informative than frames. Video frames are much richer in content than labels but only a few frames can be predicted ahead. Captions lie in between these two extremes: they can describe changes in activities for a longer prediction horizon and provide a much richer description than labels. In this thesis, we provide three distinct prediction frameworks leveraged upon different computer vision and machine learning techniques. However, these solution methods require lots of labeled data which is challenging due to high annotation cost. Thus, we also propose a novel early prediction framework so that video annotation becomes scalable.
Most of the existing works on labeling human activities focus on the recognition or early recognition problem where complete or partial observations of the activity are available. However, in the prediction problem we are addressing, no observation of the future activity is available beforehand. We propose a system that can infer about the labels and the starting time of a sequence of future unobserved activities combining different context attributes from the observed portion of the video . Next, we propose a sequence-to-sequence learning-based approach using an encoder-decoder LSTM pair for captioning the near-future unobserved activity sequences.
Building upon the prediction framework, we also work on the frame reconstruction problem in a multi-camera scenario. When a camera has multiple missing frames and available frames within the camera are far apart, the corresponding frames from other overlapping cameras become crucial for reconstruction . We propose an adversarial approach using conditional Generative Adversarial Network (cGAN) where the conditional input is the preceding or following frames within the camera or the corresponding frames from other cameras, all of which are merged together using a weighted average. We also propose an adversarial learning solution to the multi-modal frame reconstruction problem where we learn a mapping between 3D LIDAR point clouds and RGB images. This facilitates faster processing since fusion-based approaches which try to combine the advantages from both sources of data consume huge computing resources.
We also consider the video annotation problem, as it crucial for machine learning approaches described above. State-of-the-art video annotation approaches assume that there is no latency for looking up the correct category of label and the annotator is required to watch the whole video segment. However, choosing the correct label from thousands of categories is not instantaneous and the long viewing time adds to the annotation cost. We propose an LSTM-based early prediction framework which can be combined with any existing active learning approach to provide a list of early suggestions to the annotator. This reduces annotation time and cost by a significant margin.