Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the singing-voice separation task in real-world music recordings by proposing a generative diffusion model conditioned on the mixture audio. The method is trained via time-frequency masking modeling and augmented with complementary data enhancement strategies, enabling users to flexibly adjust the number of denoising steps and noise scheduling during iterative sampling—thereby achieving a controllable trade-off between separation quality and inference efficiency. Compared to conventional generative approaches, the framework significantly improves generalization and interactive controllability. Objectively, it matches the performance of state-of-the-art non-generative baselines, and ablation studies confirm the substantial impact of sampling parameters on separation fidelity. The core contribution lies in the first end-to-end integration of conditional diffusion mechanisms for singing-voice separation, jointly optimizing perceptual fidelity, robustness to real-world distortions, and user-adjustable inference behavior.

Technology Category

Application Category

📝 Abstract
Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.
Problem

Research questions and friction points this paper is trying to address.

Separating singing vocals from music mixtures
Using diffusion models for source separation tasks
Improving quality-efficiency trade-off in vocal extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model generates separated singing vocals
Model conditioned on music mixtures for training
Iterative sampling enables quality-efficiency trade-off control
🔎 Similar Papers
No similar papers found.