Controllable Monophonic Music Generation via Latent Variable Disentanglement
Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Controllable Monophonic Music Generation via Latent Variable Disentanglement

  • Author(s): Chen, Ke
  • Advisor(s): Dubnov, Shlomo
  • et al.

Automatic music generation is an attractive topic in the interdisciplinary field of music and computer science. The appearance of deep learning technique has brought in new methodologies to this topic. Diving to this topic inspires us to understand how computers process music elements from notes, beats to melodies, structures and dynamics. This further helps humans to better understand the music if we could afterwards extract creation mechanisms from machines. In the generation problem, how to make human interact with the computer is an interesting problem. Drawing an analogy with automatic image completion systems, we propose Music SketchNet, a neural network framework that allows users to specify partial musical ideas guiding monophonic music generation. We focus on generating the missing measures in incomplete monophonic musical pieces, conditioned on surrounding context, and optionally guided by user-specified pitch and rhythm snippets. First, we introduce SketchVAE, a novel variational autoencoder that explicitly factorizes rhythm and pitch contour to form the basis of our proposed model. Then we introduce two discriminative architectures, SketchInpainter and SketchConnector, that in conjunction perform the guided music completion, filling in representations for the missing measures conditioned on surrounding context and user-specified snippets. In the experiment, we first evaluate the SketchVAE on three standard datasets from different genres including folk, classic and pop songs. Then we evaluate the whole SketchNet on a standard dataset of Irish folk music and compare with models from recent works. When used for music completion, our approach outperforms the state-of-the-art both in terms of objective metrics and subjective listening tests. Finally, we demonstrate that our model can successfully incorporate user-specified snippets during the generation process.

Main Content
Current View