SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

📅 2024-09-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses target sound extraction in complex acoustic scenes. We propose a language-guided audio diffusion Transformer that jointly leverages cross-modal alignment via CLAP and latent-space diffusion. Our architecture features a skip-connected Transformer backbone and incorporates text-to-audio synthesis for data augmentation, enabling robust zero-shot and few-shot separation. Crucially, we unify textual descriptions and audio prompts into a single semantic representation, enhancing fine-grained, semantics-driven separation. Evaluated on FSD Kaggle 2018 and AudioSet benchmarks, our method achieves state-of-the-art performance, with significant improvements in out-of-distribution generalization and few-shot robustness. This establishes a novel paradigm for open-vocabulary sound separation.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.

Problem

Research questions and friction points this paper is trying to address.

Audio Separation

Noisy Environment

Sound Extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

SoloAudio

CLAP

text-to-speech

🔎 Similar Papers

Language-Queried Target Sound Extraction Without Parallel Training Data