We generate massive amounts of video data every day. While most real-world videos are long and untrimmed with sparsely localized segments of interest, existing AI systems that can interpret videos today often rely on static image analysis or can only process temporal information in a short video snippet. To automatically understand the content of long video streams, this thesis mainly describes the efforts to design accurate, efficient, and intelligent deep learning algorithms for temporal activity detection in untrimmed videos.
Detecting segments of interest from untrimmed videos is a key step towards segment-level video understanding. Depending on the purposes of tasks being performed, we address three different activity detection tasks: detecting activities of interest from videos without specific purposes (i.e., temporal activity detection); detecting temporal segment that best corresponds to a language query (i.e., natural language moment retrieval); and detecting activities given less supervision (i.e., weakly-supervised or few-shot activity detection).
In temporal activity detection, We first propose a highly unified single-shot temporal activity detector based on fully 3D convolutional networks, by eliminating explicit temporal proposal and classification stages. Evaluations show that it achieves state-of-the-art on temporal activity detection while being super efficient to operate at 1271 FPS. We then investigate how to effectively apply a multi-scale architecture to model activities with various temporal length and frequency. We propose three novel architecture designs: (1) dynamic temporal sampling; (2) two-branch feature hierarchy; (3) multi-scale contextual feature fusion, and we combine all these components into a uniform network and achieve the state-of-the-art on a much larger temporal activity detection benchmark.
In natural language moment retrieval, we aim to localize the segment that best corresponds to a given language query. We present a language-guided temporal attention module and an iterative graph adjustment network to handle the semantic and structural misalignment between video and language. The proposed model demonstrates superior capability to handle temporal relations, thus, significantly improves the state-of-the-art by a large margin.
Finally, we study the problem of weakly-supervised and few-shot temporal activity detection to mitigate the drawbacks of huge amounts of supervision needed to train a temporal detection model. Namely, we answer the question if we can learn a temporal activity detector under weak supervision that is able to localize unseen activity classes. A novel meta-learning based detection method is accordingly proposed by adopting the few-shot learning technique of Relation Network. Results show that our method achieves performance superior or competitive to state-of-the-art approaches with stronger supervision.
In summary, we propose a suite of algorithms and solutions to automatically detect segments of interest in long untrimmed videos. We hope our studies could provide insights for researchers to explore new deep learning paradigms for future computer vision research, especially on video-related topics.