Subtractive Training for Music Stem Insertion using Latent Diffusion Models

📅 2024-06-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of precise, controllable synthesis of missing instrument parts in polyphonic music. The proposed method introduces a cross-modal latent-space diffusion framework trained via subtractive pairing: paired data consist of full mixes and their corresponding single-instrument-removed variants, conditioned on text instructions—generated by large language models—that describe rhythm, dynamics, and stylistic attributes. Multi-condition cross-attention guides joint audio-MIDI modeling in the latent space to reconstruct the missing part. Key contributions include: (i) the first subtractive training paradigm for instrument-part completion; (ii) fine-grained, text-driven re-synthesis of individual parts; and (iii) the first audio-plus-MIDI dual-modality conditional generation for single-instrument reconstruction. Experiments demonstrate high-fidelity, time-frequency coherent synthesis—e.g., of drum kits—while preserving other parts intact, enabling flexible style transfer and extending successfully to bass, guitar, and other multi-track instrument generation.

Technology Category

Application Category

📝 Abstract
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
Problem

Research questions and friction points this paper is trying to address.

Polyphonic Music Processing
Instrument Voice Modification
Audio Harmony Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subtractive Training
Music Modification
MIDI Application
🔎 Similar Papers
No similar papers found.
I
Ivan Villa-Renteria
Stanford University
M
Mason L. Wang
Stanford University
Z
Zachary Shah
Stanford University
Z
Zhe Li
Hong Kong Polytechnic University
Soohyun Kim
Soohyun Kim
Korea University
Deep LearningComputer VisionGenerative Models
Neelesh Ramachandran
Neelesh Ramachandran
Stanford University
Mert Pilanci
Mert Pilanci
Stanford University
Machine LearningOptimizationNeural NetworksSignal ProcessingInformation Theory