Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive information. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. The heterogeneity of the data brings unique challenges while working with multimodal signals. One such challenge is to identify and understand the alignment between two different modalities. In this dissertation, we focus on learning to align vision and language modalities in static and dynamic tasks in different scenarios.
In the first dimension, we address the task of text-based video moment localization. Existing approaches assume that the relevant video is already known/given and attempt to localize the moment based on text query on that given video only. We relax this strong assumption and address the task of localizing moments in a corpus of videos for a text query. This task poses a unique challenge as the system is required to perform retrieval of the relevant/correct video and temporal localization of the moment in the detected video based on the text query simultaneously. Our proposed approach learns to distinguish subtle differences between intra-video moments as well as distinguish inter-video global semantic concepts based on text queries.
We also consider text-based temporal localization task where both the video moments and text queries are not observed/available during training. Conventional approaches are trained and evaluated relying on the assumption that the localization system, during testing, will only encounter events that are available in the training set. As a result, these models are unlikely to generalize to the practical requirement of localizing a wider range of events, some of which may be unseen. Towards solving this problem, we formulate the inference task of text-based localization of moments as a relational prediction problem, hypothesizing a conceptual relation between semantically relevant moments. The likelihood of a candidate moment being the correct one based on an unseen text query will depend on its relevance to the moment corresponding to the semantically most relevant seen query.
Continuing in the direction of learning to align multimodal data, we extend it to the dynamic task of Audio-Visual-Language embodied navigation in 3D environments. The goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. We propose a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio cues for navigation or to query the oracle and (b) lower-level policies to select navigation actions based on its audio-visual or audio-visual-language inputs. The policies are trained via rewarding for the success of the navigation task while minimizing the number of queries to the oracle.