In recent years, diffusion models have demonstrated promising results in cross-modalgeneration tasks within generative media, encompassing image, video, and audio generation.
This development has introduced a great deal of novelty to audio and music-related tasks, such
as text-to-sound and text-to-music generation. However, these text-controlled music generation
models typically focus on capturing global musical attributes, such as genre and mood, and do
not allow for the more fine-grained control that composers might desire. Music composition is a
complex, multilayered task that frequently involves intricate musical arrangements as an essential
part of the creative process. This task requires composers to carefully align each instrument with
existing tracks in terms of beat, dynamics, harmony, and melody, demanding a level of precision
and control over individual tracks that current text-driven prompts often fail to provide.
In this work, we address these challenges by presenting a multi-track music generationmodel, one of the first of its kind. Our model, by learning the joint probability of tracks sharing
a context, is capable of generating music across several tracks that correspond well to each other,
either conditionally or unconditionally. We achieve this by extending the MusicLDM—a latent
diffusion model for music—into a multi-track generative model. Additionally, our model is
capable of arrangement generation, where it can generate any subset of tracks given the others
(e.g., generating a piano track that complements given bass and drum tracks). We compared our
model with existing multi-track generative models and demonstrated that our model achieves
considerable improvements across objective metrics, for both total and arrangement generation
tasks. Additionally, we demonstrated that our model is capable of meaningful conditioning
generation with text and reference musical audio, corresponding well to text meaning and
reference audio content/style. Sound examples form this work can be found at https://mtmusicldm.
github.io.