🤖 AI Summary
This work addresses the challenges of scarce labeled data and limited model generalization in audio event recognition. To this end, it introduces consistency regularization to this task for the first time, proposing a unified framework that integrates multi-augmentation strategies with semi-supervised learning. Methodologically, the approach generates diverse input views via strong data augmentation and jointly optimizes prediction consistency across both fully supervised and few-shot semi-supervised settings. A novel multi-augmentation composition mechanism is designed to enhance view diversity and robustness. Experiments on the AudioSet dataset—under a realistic scale disparity (20K labeled vs. 1.8M unlabeled samples)—demonstrate that the method significantly outperforms strong baselines. In the few-shot semi-supervised setting, it achieves substantial performance gains over the best model trained solely on limited labeled data. These results validate the effectiveness and scalability of consistency regularization for audio event recognition.
📝 Abstract
Consistency regularization (CR), which enforces agreement between model predictions on augmented views, has found recent benefits in automatic speech recognition [1]. In this paper, we propose the use of consistency regularization for audio event recognition, and demonstrate its effectiveness on AudioSet. With extensive ablation studies for both small ($sim$20k) and large ($sim$1.8M) supervised training sets, we show that CR brings consistent improvement over supervised baselines which already heavily utilize data augmentation, and CR using stronger augmentation and multiple augmentations leads to additional gain for the small training set. Furthermore, we extend the use of CR into the semi-supervised setup with 20K labeled samples and 1.8M unlabeled samples, and obtain performance improvement over our best model trained on the small set.