Visual recognition and natural language understanding are two central challenges in artificial intelligence. In recent years, there has been growing interest in jointly addressing these problems, with applications such as visual question answering, referring expression comprehension, and image captioning. Since language is a fundamental tool for communication, these applications provide intuitive ways for humans to interact with intelligent systems. In this dissertation, we design algorithms for grounding visual content through natural language across tasks of varying granularity. Our work spans both static images and dynamic videos, covering 1) referring expression comprehension, 2) text-guided video temporal grounding, and 3) generalized entity grounding.
First, we explore the problem of referring expression comprehension, which aims to localize an object within a scene based on a natural language description. The variability of synonymous sentences describing the same object in different ways poses a challenge for learning effective comprehension models. While previous work typically treats each sentence individually, we account for the property of synonymy in referring expression comprehension. We develop an end-to-end trainable framework that learns contrastive features at both the image and instance levels, ensuring that sentences describing the same object are closer in the visual domain. Our method outperforms state-of-the-art approaches and demonstrates transferability to unseen datasets.
Second, we tackle the task of text-guided video temporal grounding, which seeks to identify the temporal interval of a specified event based on a natural language description. Unlike most existing methods that rely solely on visual features from RGB frames, we propose a multimodal framework that leverages complementary information from videos. Specifically, we use RGB for appearance, optical flow for motion, and depth for structural information. To better integrate these modalities and enable inter-modal learning, we design a dynamic fusion module with transformers to model interactions between them. Additionally, we incorporate intra-modal self-supervised learning to enhance feature representations across videos. We demonstrate that the proposed multimodal framework with inter- and intra-modal feature learning surpasses previous approaches.
Third, we investigate the task of generalized entity grounding. The goal is to densely ground visual entities from a long caption. This task resembles referring expression segmentation but is more challenging as it requires finding a segmentation mask associated to each noun phrase in a long caption, and it aims to ground both "thing" and "stuff" categories. To address these challenges, we employ a large multimodal model to extract semantic nouns and a class-agnostic segmentation model to generate entity masks. We then correlate these outputs using the proposed feature blending and association modules. Experiments on panoptic narrative grounding, referring expression segmentation, and panoptic segmentation demonstrate the effectiveness of the proposed method.