🤖 AI Summary
This work addresses target sound extraction in complex acoustic scenes. We propose a language-guided audio diffusion Transformer that jointly leverages cross-modal alignment via CLAP and latent-space diffusion. Our architecture features a skip-connected Transformer backbone and incorporates text-to-audio synthesis for data augmentation, enabling robust zero-shot and few-shot separation. Crucially, we unify textual descriptions and audio prompts into a single semantic representation, enhancing fine-grained, semantics-driven separation. Evaluated on FSD Kaggle 2018 and AudioSet benchmarks, our method achieves state-of-the-art performance, with significant improvements in out-of-distribution generalization and few-shot robustness. This establishes a novel paradigm for open-vocabulary sound separation.
📝 Abstract
In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.