Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels

πŸ“… 2025-10-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
High-cost manual annotation of sound event temporal boundaries severely limits the scalability of fully supervised learning; while existing weakly supervised methods predominantly rely on clip-level labels, partial label learning remains unexplored in audio analysis. This paper pioneers the introduction of partial label learning to audio event detection (AED), proposing a novel framework that leverages semantic priors from acoustic scene classification to automatically generate partial labelsβ€”i.e., annotating only a subset of positive time intervals for each sound event. We formulate a multi-task learning architecture jointly optimizing acoustic scene classification and AED. Furthermore, we design a self-distillation-based label refinement mechanism to synergistically train on both fully labeled and partially labeled data. Experiments demonstrate substantial reductions in annotation cost and consistent performance improvements across multiple benchmarks, validating the effectiveness and scalability of partial label learning in realistic audio scenarios.

Technology Category

Application Category

πŸ“ Abstract
Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.
Problem

Research questions and friction points this paper is trying to address.

Reducing annotation costs for sound event detection using partial labels
Jointly analyzing acoustic scenes and sound events through multitask learning
Improving detection performance via semi-supervised training with partial labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses acoustic scenes to construct partial sound event labels
Proposes multitask learning with partial labels for joint analysis
Introduces self-distillation label refinement for partial labels
πŸ”Ž Similar Papers
No similar papers found.