FlexSED: Towards Open-Vocabulary Sound Event Detection

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sound event detection (SED) methods are constrained by the closed-vocabulary assumption, limiting support for free-text queries and exhibiting poor zero-shot and few-shot generalization. Moreover, existing text-guided source separation techniques are ill-suited for SED tasks requiring fine-grained temporal localization and open-vocabulary retrieval. To address these limitations, we propose Open-SED—a novel framework for open-vocabulary SED. It integrates an audio self-supervised encoder (e.g., Data2Vec) with the CLAP text encoder, employs an adaptive cross-modal fusion decoder, and leverages large language models to generate high-quality, diverse event queries for enhanced supervision. Through continuous pretraining and collaborative annotation, Open-SED achieves substantial improvements on AudioSet-Strong: it outperforms conventional SED methods by +12.3% mAP in zero-shot and +9.7% mAP in 5-shot settings. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.
Problem

Research questions and friction points this paper is trying to address.

Detects sounds from free-text queries
Enables zero-shot and few-shot sound detection
Improves temporal localization for diverse sound vocabularies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines pretrained audio SSL model with CLAP text encoder
Uses encoder-decoder composition with adaptive fusion strategy
Employs LLMs for event query selection during training
🔎 Similar Papers
No similar papers found.
Jiarui Hai
Jiarui Hai
Johns Hopkins University
computer auditiongenerative modelsmusic information retrieval
H
Helin Wang
Department of Electrical and Computer Engineering, Johns Hopkins University, Maryland, USA
W
Weizhe Guo
Department of Electrical and Computer Engineering, Johns Hopkins University, Maryland, USA
Mounya Elhilali
Mounya Elhilali
Professor of electrical and computer engineering, the johns hopkins university