🤖 AI Summary
Existing sound event detection (SED) methods are constrained by the closed-vocabulary assumption, limiting support for free-text queries and exhibiting poor zero-shot and few-shot generalization. Moreover, existing text-guided source separation techniques are ill-suited for SED tasks requiring fine-grained temporal localization and open-vocabulary retrieval. To address these limitations, we propose Open-SED—a novel framework for open-vocabulary SED. It integrates an audio self-supervised encoder (e.g., Data2Vec) with the CLAP text encoder, employs an adaptive cross-modal fusion decoder, and leverages large language models to generate high-quality, diverse event queries for enhanced supervision. Through continuous pretraining and collaborative annotation, Open-SED achieves substantial improvements on AudioSet-Strong: it outperforms conventional SED methods by +12.3% mAP in zero-shot and +9.7% mAP in 5-shot settings. The code and pretrained models are publicly released.
📝 Abstract
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.