The success of deep neural networks in the past decade was founded on supervised learning, but as the model and dataset sizes have grown so has the desire to break away from the costly process of human annotations, a requirement in supervised learning. In addition to the cost of data labeling, the human annotation process can be ambiguous, prone to biases, and involve privacy concerns (e.g. medical imaging). Here, self-supervised learning offers a path forward where models can be trained on cheap and abundantly available unlabeled data.
Improving Self-Supervised Learning. The idea behind self-supervised learning (SSL) is to use the training pipeline of supervised learning, but since there are no labels, replace the supervised task with a "pretext" task that is derived from the data itself. A well designed pretext task induces the model to learn a feature representation of the data that is useful for downstream tasks. Since the performance of an SSL model heavily relies on the pretext task used, one of my research objectives is to improve their design. Specifically, my contributions are as follows. 1) Traditional SSL pretext tasks are less effective for smaller capacity models than larger capacity models. Hence, I developed a better pretext task for smaller models. 2) Contrastive learning, a popular pretext task in SSL, suffers from the problem of treating semantically similar images as dissimilar. Hence, I developed a pretext task to fix this problem. 3) Clustering based SSL pretext tasks also suffer from the problem of incorrect negatives in addition to imposing unnecessary priors on the shape and size of clusters. Hence, I developed a mean-shift clustering pretext task that fixes these problems. 4) While an improvement over previous clustering methods, the mean-shift pretext task does not cluster semantically diverse samples. Hence, I developed a constrained mean-shift clustering pretext task that clusters semantically relevant yet far away samples.
Understanding Self-Supervised Learning. When scaled to large datasets, SSL models have been shown to learn rich generalizable features in both Natural Language Processing and Computer Vision. The idea is so powerful that for most applications these days, the default first step is to load a self-supervised model and then either use it in a few-shot setting or fine-tune it for a specific application. Hence, in addition to improving SSL models, the other objective of my research is to understand their inner working. I have made following contributions towards this objective. 1) SSL models are vulnerable to a class of adversarial attacks called "backdoor attacks". If an attacker can hijack the data collection pipeline, then they can alter the behavior of the model such that it fails in the presence of an attacker chosen backdoor. I analyzed the mechanism through which backdoors affect SSL models and used the insights to develop a defense for the attack. 2) Backdoor attacks are possible since SSL models learn shortcut features present in the dataset. Given that the scope of SSL models extends beyond computer vision, I was interested in understanding the types of shortcuts exploited by the language component in vision-language contrastive models. I showed that the model ignores grammatical structure of the language and simply uses language as a bag-of-words (BoWs).