It was a dream to make computers intelligent. Like humans who are capable of understanding information of multiple modalities such as video, text, audio, etc., teaching computers to jointly understand multi-modal information is a necessary and essential step towards artificial intelligence. And how to jointly represent multi-modal information is critical to such step. Although a lot of efforts have been devoted to exploring the representation of each modality individually, it is an open and challenging problem to learn joint multi-modal representation.
In this dissertation, we explore joint image-text representation models based on Visual-Semantic Embedding (VSE). VSE has been recently proposed and shown to be effective for joint representation. The key idea is that by learning a mapping from images into a semantic space, the algorithm is able to learn a compact and effective joint representation. However, existing approaches simply map each text concept and each whole image to single points in the semantic space. We propose several novel visual-semantic embedding models that use (1) text concept modeling, (2) image-level modeling, and (3) object-level modeling. In particular, we first introduce a novel Gaussian Visual-Semantic Embedding (GVSE) model that leverages the visual information to model text concepts as density distributions rather than single points in semantic space. Then, we propose Multiple Instance Visual-Semantic Embedding (MIVSE) via image-level modeling, which discovers and maps the semantically meaningful image sub-regions to their corresponding text labels. Next, we present a fine-grained object-level representation in images, Scene-Domain Active Part Models (SDAPM), that reconstructs and characterizes 3D geometric statistics between object’s parts in 3D scene-domain. Finally, we explore advanced joint representations for other visual and textual modalities, including joint image-sentence representation and joint video-sentence representation.
Extensive experiments have demonstrated that the proposed joint representation models are superior to existing methods on various tasks involving image, video and text modalities, including image annotation, zero-shot learning, object and parts detection, pose and viewpoint estimation, image classification, text-based image retrieval, image captioning, video annotation, and text-based video retrieval.