Large-scale generative models have fueled recent progress in artificial intelligence. Armed with scaling laws that accurately predict model performance as invested compute increases, NLP has become the gold standard for all disciplines of AI. Given a new task, pre-trained generative models can either solve it zero-shot or be efficiently fine-tuned on a small amount of task-specific training examples. However, the widespread adoption of generative models has lagged in other domains---such as vision and meta-learning. In this thesis, we study ways to train improved, scalable generative models of two modalities---images and neural network parameters. We also examine how pre-trained generative models can be leveraged to tackle additional downstream tasks.
We begin by introducing a new, powerful class of generative models---Diffusion Transformers (DiTs). We show that transformers---with one small yet critically-important modification---retain their excellent scaling properties for diffusion-based image generation and outperform convolutional neural networks that have previously dominated the area. DiT outperforms all prior generative models on the class-conditional ImageNet generation benchmark.
Next, we introduce a novel framework for learning to learn based on building generative models of a new data source---neural network checkpoints. We create datasets containing hundreds of thousands of deep learning training runs and use it to train generative models of neural network checkpoints. Given a starting parameter vector and a target loss, error or reward, loss-conditional diffusion models trained on this data can sample parameter updates that achieve a desired metric. We apply our framework to problems in vision and reinforcement learning.
Finally, we explore how pre-trained image-level generative models can be used to tackle downstream tasks in vision without requiring task-specific training data. We show that pre-trained GAN generators can be used to create an infinite data stream to train networks for the dense visual correspondence problem---without requiring any human-annotated supervision like keypoints. Networks trained on this completely GAN-generated data generalize zero-shot to real images, and they outperform previous self-supervised and keypoint-supervised approaches that train on real data.