SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

📅 2024-09-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses target sound extraction in complex acoustic scenes. We propose a language-guided audio diffusion Transformer that jointly leverages cross-modal alignment via CLAP and latent-space diffusion. Our architecture features a skip-connected Transformer backbone and incorporates text-to-audio synthesis for data augmentation, enabling robust zero-shot and few-shot separation. Crucially, we unify textual descriptions and audio prompts into a single semantic representation, enhancing fine-grained, semantics-driven separation. Evaluated on FSD Kaggle 2018 and AudioSet benchmarks, our method achieves state-of-the-art performance, with significant improvements in out-of-distribution generalization and few-shot robustness. This establishes a novel paradigm for open-vocabulary sound separation.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.
Problem

Research questions and friction points this paper is trying to address.

Audio Separation
Noisy Environment
Sound Extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

SoloAudio
CLAP
text-to-speech
🔎 Similar Papers
No similar papers found.
H
Helin Wang
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
Jiarui Hai
Jiarui Hai
Johns Hopkins University
computer auditiongenerative modelsmusic information retrieval
Y
Yen-Ju Lu
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
Karan Thakkar
Karan Thakkar
Johns Hopkins University
Auditory PerceptionDeep LearningGenerative ModelingBrain Decoding
Mounya Elhilali
Mounya Elhilali
Professor of electrical and computer engineering, the johns hopkins university
N
N. Dehak
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA