The goal of extracting reusable and rich representations that capture what you care about for downstream tasks remains challenging even though the field of deep learning has made tremendous progress in this direction. This thesis presents a few promising contributions to further that goal. The two axes of contributions are: (1) self-supervised (or unsupervised) representation learning; (2) deep neural network architectures powered by self-attention. Progress in architectures and the ability to leverage massive amounts of unlabeled data have been responsible for major advances in NLP such as GPT-x and BERT. This thesis presents small steps towards realizing such progress for perceptual and reinforcement learning tasks. This is a thesis by articles containing four articles, two focused on computer vision benchmarks, with the other two focused on reinforcement learning.
With respect to the first axis, the thesis presents three articles: (1) Data-Efficient Image Recognition using Contrastive Predictive Coding (CPCv2); (2) Contrastive Unsupervised Representations for Reinforcement Learning (CURL); (3) Reinforcement Learning with Augmented Data (RAD). The first two articles explore a form of unsupervised learning called contrastive learning, a technique better suited for raw inputs such as images compared to generative pre-training that is popular for language. The first article presents results for label-efficient image recognition. The second article presents the benefits of contrastive learning for sample-efficient reinforcement learning from pixels. Contrastive learning in practice is heavily dependent on data augmentations, and the third article presents a detailed investigation and discussion of its role.
As for the second axis, the thesis presents a thorough empirical investigation of the benefits of self-attention and Transformer-like architectures for computer vision through the article: Bottleneck Transformers for Visual Recognition. Self-attention has revolutionized language processing but computer vision presents a challenge to vanilla Transformers through high resolution inputs that challenge the quadratic memory and computational complexity of the primitive. The article presents the empirical effectiveness of a straightforward hybrid composed of convolutions and self-attention and unifies the ResNet and Transformer based architecture design for computer vision.