Improving Audio Event Recognition with Consistency Regularization

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of scarce labeled data and limited model generalization in audio event recognition. To this end, it introduces consistency regularization to this task for the first time, proposing a unified framework that integrates multi-augmentation strategies with semi-supervised learning. Methodologically, the approach generates diverse input views via strong data augmentation and jointly optimizes prediction consistency across both fully supervised and few-shot semi-supervised settings. A novel multi-augmentation composition mechanism is designed to enhance view diversity and robustness. Experiments on the AudioSet dataset—under a realistic scale disparity (20K labeled vs. 1.8M unlabeled samples)—demonstrate that the method significantly outperforms strong baselines. In the few-shot semi-supervised setting, it achieves substantial performance gains over the best model trained solely on limited labeled data. These results validate the effectiveness and scalability of consistency regularization for audio event recognition.

Technology Category

Application Category

📝 Abstract
Consistency regularization (CR), which enforces agreement between model predictions on augmented views, has found recent benefits in automatic speech recognition [1]. In this paper, we propose the use of consistency regularization for audio event recognition, and demonstrate its effectiveness on AudioSet. With extensive ablation studies for both small ($sim$20k) and large ($sim$1.8M) supervised training sets, we show that CR brings consistent improvement over supervised baselines which already heavily utilize data augmentation, and CR using stronger augmentation and multiple augmentations leads to additional gain for the small training set. Furthermore, we extend the use of CR into the semi-supervised setup with 20K labeled samples and 1.8M unlabeled samples, and obtain performance improvement over our best model trained on the small set.
Problem

Research questions and friction points this paper is trying to address.

Applying consistency regularization to audio event recognition
Evaluating CR effectiveness on AudioSet with supervised training
Extending CR to semi-supervised learning with unlabeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency regularization for audio event recognition
Stronger and multiple augmentations for small datasets
Semi-supervised setup with labeled and unlabeled samples
🔎 Similar Papers
No similar papers found.
S
Shanmuka Sadhu
Dept. of Computer Science, Rutgers University, New Brunswick, NJ, USA
Weiran Wang
Weiran Wang
University of Iowa
Machine learningspeech processing