Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing sound event detection (SED) methods operate under a closed-set assumption, limiting generalization to unseen classes. This work proposes an open-vocabulary SED framework that formulates detection as a frame-level cross-modal retrieval task, enabling zero-shot recognition and localization conditioned on text or audio prompts. Key contributions include: (1) a dual-stream decoder architecture that decouples event classification from temporal localization; (2) an inference-time attention masking strategy to improve generalization to novel categories; and (3) a CLAP-based multimodal feature extractor coupled with a cross-modal event decoder that jointly fuses query–feature representations and models temporal dependencies. Experiments demonstrate state-of-the-art performance: the method surpasses the CLAP baseline by 7.8 PSDS on AudioSet (a +6.9 gain in closed-set evaluation) and achieves 42.2 PSDS1 on zero-shot transfer to DESED—marking the first time an unsupervised approach outperforms supervised baselines.

Technology Category

Application Category

📝 Abstract

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

Problem

Research questions and friction points this paper is trying to address.

Enhancing open-vocabulary sound event detection via multi-modal queries

Improving fine-grained alignment and cross-modal fusion in audio-language models

Balancing localization accuracy and generalization to novel sound classes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal query framework for open-vocabulary SED

Dual-stream decoder decouples recognition and localization

Inference-time attention masking enhances novel class generalization

🔎 Similar Papers

No similar papers found.

Authors to Follow