Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Learning to see and hear without human supervision


Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound serves as a bridge to connect multiple instances of a visual scene. It can group scenes that 'go together' and set apart the ones that do not. Co-occurring sensory signals can thus be used as a target to learn powerful representations for visual inputs without relying on costly human annotations.

In this thesis, I introduce effective self-supervised learning methods that curb the need for human supervision. I discuss several tasks that benefit from audio-visual learning, including representation learning for action and audio recognition, visually-driven sound source localization, and spatial sound generation. I introduce an effective contrastive learning framework that learns audio-visual models by answering multiple-choice audio-visual association questions. I also discuss critical challenges we face when learning from audio supervision related to noisy audio-visual associations, and the lack of spatial grounding of sound signals in common videos.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View