🤖 AI Summary
Existing supervised audio source separation methods suffer from heavy reliance on large-scale annotated datasets, poor generalization, and limited adaptability to open-set acoustic scenarios. To address these limitations, this paper proposes the first zero-shot, training-free separation framework. Methodologically, it leverages a pre-trained text-to-audio diffusion model, performing mixture latent-space inversion followed by text-conditioned denoising—enabling on-demand separation of arbitrary sound sources without fine-tuning or task-specific data. The core contribution lies in the novel adaptation of generative diffusion models to discriminative separation tasks, replacing conventional supervision with textual priors and introducing a cross-modal alignment-driven conditional sampling mechanism. Experiments demonstrate that the method outperforms supervised baselines across multiple benchmarks, supports open-vocabulary sound descriptions, and significantly enhances robustness and adaptability in realistic acoustic environments.
📝 Abstract
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.