Multi-Track Music Generation with Latent Diffusion Models
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Multi-Track Music Generation with Latent Diffusion Models

No data is associated with this publication.
Abstract

In recent years, diffusion models have demonstrated promising results in cross-modalgeneration tasks within generative media, encompassing image, video, and audio generation. This development has introduced a great deal of novelty to audio and music-related tasks, such as text-to-sound and text-to-music generation. However, these text-controlled music generation models typically focus on capturing global musical attributes, such as genre and mood, and do not allow for the more fine-grained control that composers might desire. Music composition is a complex, multilayered task that frequently involves intricate musical arrangements as an essential part of the creative process. This task requires composers to carefully align each instrument with existing tracks in terms of beat, dynamics, harmony, and melody, demanding a level of precision and control over individual tracks that current text-driven prompts often fail to provide.

In this work, we address these challenges by presenting a multi-track music generationmodel, one of the first of its kind. Our model, by learning the joint probability of tracks sharing a context, is capable of generating music across several tracks that correspond well to each other, either conditionally or unconditionally. We achieve this by extending the MusicLDM—a latent diffusion model for music—into a multi-track generative model. Additionally, our model is capable of arrangement generation, where it can generate any subset of tracks given the others (e.g., generating a piano track that complements given bass and drum tracks). We compared our model with existing multi-track generative models and demonstrated that our model achieves considerable improvements across objective metrics, for both total and arrangement generation tasks. Additionally, we demonstrated that our model is capable of meaningful conditioning generation with text and reference musical audio, corresponding well to text meaning and reference audio content/style. Sound examples form this work can be found at https://mtmusicldm. github.io.

Main Content

This item is under embargo until June 24, 2025.