Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Robustness of Vision-Language Systems to Intentional and Incidental Adversity

No data is associated with this publication.
Abstract

The availability of large data collections makes deep neural networks perform impressively on a wide range of vision-language tasks. However, existence of corrupted samples when working with massive, hard-to-manage amount of data is inevitable. Data corruption refers to errors in data that occur during either an intentional process to alter data samples with the purpose of hiding information or a incidental data preparing step, including data annotation, which introduces unintended changes to the original data. Robustness of vision-language systems to intentional and incidental adversity is the goal of this dissertation.

Face manipulation is one of the challenging intentional adversity techniques that distort the truth. The primary threat of face manipulation stems from people becoming convinced that something fictional really occurred. The dissertation starts with study of facial manipulation detectors leveraging upon facial expression recognition systems. Concerns regarding the wide-spread use of corrupted images and videos in social media necessitate precise detection of such fraud. Multi-task learning can leverage prominent features learned by facial expression recognition system to benefit the training of conventional manipulation detection systems. Such an approach achieves impressive performance in facial expression manipulation detection, while exhibiting robustness to other common facial corruption mechanisms, such as identity manipulation.

Creation of large scale datasets, needed for tasks like image-text retrieval, is challenging. As a solution, recent works exploit vision-language pre-training to overcome the errors in the training data. Vision-language pre-training has advanced the performance for many vision-language tasks like image-text retrieval. However, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. Web-crawled captions do not accurately describe the visual content of the images, making them noisy signals. Instead, the caption descriptions can be generated by off-the-shelf automatic image captioners.

In this dissertation, we focus on learning image-text retrieval models utilizing image-text pairs, where the text is generated by an automatic captioning process without any requirement of tedious annotation effort. We propose a novel robust image-text retrieval model which utilizes data-driven curriculum learning to down-weight the noisy portion of the dataset generated by image captioners. Finally, we investigate fine-grained image-text retrieval which decomposes image-text matching into global-to-local level matching using pre-trained object detectors. We utilize an alignment model which bridges the gap between corresponding embeddings in different modalities to identify the noise in the caption. It aims to learn image-text multimodal representations that capture the appropriate fine-grained alignment between vision and language.

Main Content

This item is under embargo until January 26, 2025.