Mining Spatial and Spatio-Temporal ROIs for Action Recognition
In this paper, we propose an approach to classify action sequences. We observe that in action sequences the critical features for discriminating between actions occur only within sub-regions of the image. Hence deep network approaches will address the entire image are at a disadvantage. This motivates our strategy which uses static and spatio-temporal visual cues to isolate static and spatio-temporal regions of interest (ROIs). We then use weakly supervised learning to train deep network classifiers using the ROIs as input. More specifically, we combine multiple instance learning (MIL) with convolutional neural networks (CNNs) to select discriminative action cues. This yields classifiers for static images, using the static ROIs, as well as classifiers for short image sequences (16 frames), using spatio-temporal ROIs. Extensive experiments performed on the UCF101 and HMDB51 benchmarks show that both these types of classifiers perform well individually and achieve state of the art performance when combined together. We also show qualitatively that our ROIs (selected by the algorithms) capture the most relevant parts of the image sequences.