In computer vision, learning correspondence is a pivotal and fundamental challenge with far-reaching applications. Correspondence encapsulates a measure of similarity between disparate entities, spanning images, videos, and texts. As deep neural networks have demonstrated significant success in computer vision over the past few years, inferring correspondence has been posed as a representation learning task. We learn useful feature representations and infer correspondence with deep neural networks. In this thesis, we undertake the task of learning various types of correspondence and exploring their applications.
First, we consider acquiring dense low-level correspondence between successive video frames, where optical flow represents temporal pixel-level correspondences. Within the landscape of deep optical flow estimation methodologies, the cost volume emerges as a linchpin, encoding the vital pixel-level correlation information. Our contribution comes as a learnable cost volume (LCV) layer, leveraging a positive definite kernel matrix and optimizing its learning through Cayley representations. The proposed LCV is a lightweight module and can be easily plugged into existing models to replace the conventional cost volume. It reduces flow estimation errors and improves the model's robustness against illumination variations, noise, and adversarial input perturbations.
Second, we delve into semantic correspondence across distinct images, a task more challenging than optical flow estimation. Here, we confront the complexities stemming from vast variations in appearance, scale, and pose, even among objects in the same category. We introduce an affinity matrix to represent semantic similarity between images. Our novel approach harnesses multi-level contrastive learning for semantic matching. It leverages image-level contrastive learning to guide convolutional features in locating correspondence between similar objects. Further, we enhance performance through pixel-level cross-instance cycle consistency. This methodology outperforms prevailing approaches in this domain.
Finally, we explore the correspondence between images and text, crucial in vision-language foundation models bridging disparate modalities. These models employ visual and textual encoders, mapping both modalities into a shared embedding space. While pretrained representations from extensive data yield impressive zero-shot performance in tasks like image classification, their potential wanes when dealing with few examples per category. To address this challenge, we propose a category name initialization method that initializes the visual classification head with text embeddings of category names. Extensive experimental results show that the category name initialization method propels our model to achieve state-of-the-art results in various few-shot image classification benchmarks.