Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Representation Learning for Music and Audio Intelligence

Abstract

With recent breakthroughs in machine learning, the pursuit of efficient and effective feature representation has gradually taken center stage, igniting groundbreaking possibilities for various downstream applications. While significant progress has been made in the domains of natural language processing and computer vision, there arises an imperative need to construct a robust audio representation model that empowers advanced audio applications.

In this dissertation, we begin from an initial design of an innovative audio transformer as the cornerstone, HTS-AT, that employs imperative designs to capture semantic and acoustic information of audio data. We present a step-by-step demonstration on how we unleash the power of HTS-AT to unlock a wide range of advanced audio downstream applications in audio understanding and audio generative AI. Specifically, we first adapt HTS-AT to audio event classification, assessing its prowess in comprehending the semantics of audio tracks. Subsequently, we leverage the audio embedding of HTS-AT into audio source separation, evaluating its capability to conceive the acoustic feature of audio. To embrace more applications in conjunction with other modalities, we propose a contrastive language-audio pretraining model (CLAP) that combines HTS-AT with the language understanding model to incorporate the shared information between audio and text representations. From all above explorations, we achieve the target of content creation by proposing MusicLDM, a latent diffusion model that leverages the embeddings of CLAP to perform the text-to-music generation.

Throughout all designs, experiments, and application studies, we achieve successful adaptations and superior performance of different audio downstream tasks rising from a simple audio transformer. Besides, more potential applications in the field of audio content extraction and creation are awaiting, as we will touch upon our ongoing and forthcoming endeavors in addressing their challenges and realizing their full potential.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View