Action Recognition from Videos using Deep Neural Networks
Convolutional neural network(CNN) models have been extensively used in recent years to solve the problem of image understanding giving state-of-the-art results in tasks like classification, recognition, retrieval, segmentation and object detection. Motivated by this success there have been several attempts to extend convolutional neural networks for video understanding and classification. An important distinction between images and videos is the temporal information that is encoded by the sequence of frames. Most CNN models fail to capture this temporal information. Recurrent neural networks have shown promising results in modelling sequences.
In this work we present a neural network model which combines convolutional neural networks and recurrent neural networks. We first evaluate the effect of the convolutional network used for understanding static frames on action recognition. Following this we explore properties that are inherent in the dataset. We combine the representation we get from the convolutional network, the temporal information we get from the sequence of video frames and other properties of the dataset to create a unified model which is trained on the UCF-101 dataset for action recognition. We evaluate our model on the pre-defined test set splits of the UCF-101 dataset. We show that our model is able to achieve an improvement over the baseline model. We show comparison between our models and various models proposed in other related works on the UCF-101 dataset. We observe that a good model for action recognition not only needs to understand static frames but also needs to encode the temporal information across a sequence of frames.