Robust Audio Tagging under Class-wise Supervision Unreliability

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the prevalent issue of class-dependent unreliable supervision in audio annotation data—manifested through label redundancy, confusion among similar classes, and weakened evidential support—which introduces bias during model training. To tackle this, the authors propose the Class-level Supervision Unreliability (CSU) framework, which explicitly models three types of non-missing-label supervision noise for the first time and learns adaptive supervision weights per class to dynamically modulate label credibility during training. Notably, CSU requires no modifications to model architecture or inference procedures. It leverages a hybrid training strategy combining real and synthetically generated audio and introduces a new benchmark, ESC-FreeGen50. Extensive experiments demonstrate that CSU significantly enhances the robustness of diverse models against various supervision noises on AudioSet and controlled settings, confirming its effectiveness and generalizability.

📝 Abstract

Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue becomes harder as real and generated audio are increasingly mixed in training, and generated samples do not always match their intended semantic labels. Prior work mainly addressed unreliable supervision from missing-positive labels, while this paper targets three other sources of unreliable supervision: spurious additions, misassignments between similar classes, and weakened label evidence. These effects introduce class-dependent optimisation bias that is not explicitly modeled by most existing methods. To bridge this gap, the paper proposes a Class-wise Supervision Unreliability (CSU) framework that controls supervision strength at the class level during training. CSU learns a separate unreliability parameter for each class and down-weights less reliable supervision without changing the model architecture or inference process. To support evaluations, this paper also introduces ESC-FreeGen50, a manually verified benchmark of 50 sound classes that combines real and generated audio. Experiments on controlled benchmarks and AudioSet show that CSU improves robustness across different architectures and different sources of supervision unreliability. The results indicate that explicit class-wise modeling of supervision unreliability is an effective and practical strategy for robust audio tagging under large-scale weakly labeled training. Code and data are available at: https://github.com/Yuanbo2020/CSU

Problem

Research questions and friction points this paper is trying to address.

audio tagging

weakly labeled data

supervision unreliability

class-wise bias

label noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Class-wise Supervision Unreliability

Audio Tagging

Weakly Supervised Learning