Exploring Text-Queried Sound Event Detection with Audio Source Separation

📅 2024-09-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in sound event detection (SED) caused by overlapping acoustic events and background noise, this paper proposes a text-query-based SED framework (TQ-SED). First, we introduce AudioSep-DP—a language-driven, end-to-end differentiable audio separation model—by augmenting AudioSep with a dual-path RNN module to enhance dynamic audio modeling. Second, we design a multi-branch specialized detection head to independently identify events from each separated source. To the best of our knowledge, this is the first work to jointly model cross-modal language–audio alignment, a dual-path RNN–CNN hybrid separation architecture, and event detection. Evaluated on the DCASE 2024 Task 9 objective single-model track, TQ-SED achieves first place, improving F1 score by 7.22% over conventional SED methods. The code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract
In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks corresponding to different events from the input audio. Then, multiple target SED branches are employed to detect individual events. AudioSep is a state-of-the-art LASS model, but has limitations in extracting dynamic audio information because of its pure convolutional structure for separation. To address this, we integrate a dual-path recurrent neural network block into the model. We refer to this structure as AudioSep-DP, which achieves the first place in DCASE 2024 Task 9 on language-queried audio source separation (objective single model track). Experimental results show that TQ-SED can significantly improve the SED performance, with an improvement of 7.22% on F1 score over the conventional framework. Additionally, we setup comprehensive experiments to explore the impact of model complexity. The source code and pre-trained model are released at https://github.com/apple-yinhan/TQ-SED.
Problem

Research questions and friction points this paper is trying to address.

Sound Event Separation
Noise Reduction
Audio Event Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

TQ-SED
AudioSep-DP
Dual-Path RNN
🔎 Similar Papers
No similar papers found.