Multi-Source Music Generation with Latent Diffusion

📅 2024-09-10
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-source diffusion models (MSDMs) for music generation suffer from impoverished melodic content, frequent silent segments, and Gaussian noise artifacts. To address these issues, we propose a joint latent-space diffusion framework for controllable multi-instrument music generation. Our method introduces a source-specific variational autoencoder (VAE) that learns disentangled “source latent spaces” for piano, drums, bass, and other instruments. Within a unified joint latent space, we simultaneously model and diffuse all source signals before mixing them into the final audio. The framework integrates latent-variable concatenation with collaborative sampling strategies and employs the Fréchet Audio Distance (FAD) for objective evaluation. Experiments demonstrate substantial improvements in melodic richness and source completeness, complete elimination of silent segments and Gaussian artifacts, superior performance over MSDMs in both subjective listening tests and FAD scores, and higher generation efficiency.

Technology Category

Application Category

📝 Abstract
Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a"source latent."The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo/.
Problem

Research questions and friction points this paper is trying to address.

Generates multi-source music with coherent instrumental sources
Reduces Gaussian noise artifacts in waveform diffusion
Improves melody richness and prevents empty sound generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Variational Autoencoders for latent encoding
Trains diffusion model on joint latent space
Enhances generation via latent compression
🔎 Similar Papers
No similar papers found.