UC San Diego
Adaptation of Visual Models with Cross-modal Regularization
- Author(s): Costa Pereira, Jose Maria C.
- Advisor(s): Vasconcelos, Nuno
- et al.
Semantic representations of images have been widely adopted in Computer Vision. A vocabulary of concepts of interest is first identified and classifiers are learned for the detection of those concepts. Images are classified and mapped to a space where each feature is a score for the detection of a concept. This representation brings several advantages. First, the generalization from low-level features to concept-level enables similarity measures that correlate much better with user expectations. Second, because semantic features are, by definition, discriminant for tasks like image categorization, the semantic representation enables a solution for such tasks with low-dimensional classifiers. Third, the semantic representation is naturally aligned with recent interest on contextual modeling. This is of importance for tasks such as object recognition, where detection of contextually related objects has been shown to improve detection of certain objects of interest, or semantic segmentation, where the coherence of segment semantics can be exploited to achieve more robust segmentations. Lastly, due to their abstract nature, semantic spaces enable a unified representation for data from different content modalities, e.g. images, text, or audio. This opens up a new set of possibilities for multimedia processing, enabling operations such as cross-modal retrieval, or image de-noising by text regularization. This unified representation for multi-modal data is the starting point of the proposed framework on adaptation of visual models with cross-modal regularization.
We start by pointing the problems in computing similarity on heterogeneous data, proposing two fundamental hypotheses to deal with those issues. One, learning a space that maximizes the correlation on the (heterogeneous) data; two, learning a representation where data lies at a higher level of abstraction. Empirical evidence is shown in favor of each hypothesis; furthermore the hypotheses are shown to be complementary. We follow on the (semantic) abstraction hypothesis for a deeper understanding on the robustness of these representations and to study the richness of this space, as it highly influences the discriminative power of such descriptors.
It has been shown that categories unknown to the semantic space, when represented in it, exhibit a pattern of co-occurring concepts that describe them accurately and sensibly; e.g. the concept of fishing might not belong to the semantic space and instead be represented by the set water, boat, people and gear. Even though the amount of labeled data continues to increase with ongoing efforts from different research communities, it is a challenging task to build a semantic space that is universal. We show evidence towards robustness of representations in the semantic space.
Noting that images are frequently published on the web together with loosely related text, we use the semantic representations described above to introduce the theoretical principles to a feature regularizer for image semantic representations based on auxiliary data. This proves very effective on improving retrieval precision and recall in the task of content-based image retrieval (CBIR). Its results are compared to recently developed methods, achieving significant gains in three benchmark datasets, raising the bar of state-of-the-art performance for image retrieval.