🤖 AI Summary
This work addresses the limitations of conventional frame-level sound event detection models, which rely on post-processing and suffer from ambiguous temporal boundaries. To overcome these issues, the authors propose an end-to-end boundary-aware modeling approach that explicitly captures event start and end times through a Recurrent Event Detection (RED) layer and an Event Proposal Network (EPN). A tailored loss function is designed to enable boundary-sensitive optimization and inference. The method operates without requiring post-processing or hyperparameter tuning and achieves state-of-the-art performance across all classes on the AudioSet Strong annotation subset, significantly outperforming existing frame-level models and post-processing-based solutions.
📝 Abstract
Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.