Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

High-cost manual annotation of sound event temporal boundaries severely limits the scalability of fully supervised learning; while existing weakly supervised methods predominantly rely on clip-level labels, partial label learning remains unexplored in audio analysis. This paper pioneers the introduction of partial label learning to audio event detection (AED), proposing a novel framework that leverages semantic priors from acoustic scene classification to automatically generate partial labels—i.e., annotating only a subset of positive time intervals for each sound event. We formulate a multi-task learning architecture jointly optimizing acoustic scene classification and AED. Furthermore, we design a self-distillation-based label refinement mechanism to synergistically train on both fully labeled and partially labeled data. Experiments demonstrate substantial reductions in annotation cost and consistent performance improvements across multiple benchmarks, validating the effectiveness and scalability of partial label learning in realistic audio scenarios.

Technology Category

Application Category

📝 Abstract

Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.

Problem

Research questions and friction points this paper is trying to address.

Reducing annotation costs for sound event detection using partial labels

Jointly analyzing acoustic scenes and sound events through multitask learning

Improving detection performance via semi-supervised training with partial labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses acoustic scenes to construct partial sound event labels

Proposes multitask learning with partial labels for joint analysis

Introduces self-distillation label refinement for partial labels

🔎 Similar Papers

Compositional Audio Representation Learning