🤖 AI Summary
Few-shot learning for audio classification remains underexplored, hindered by severe scarcity of labeled data. Method: We propose a novel framework integrating angle-aware supervised contrastive loss with prototypical networks, enhancing intra-class compactness and inter-class separability. To improve representation robustness, we incorporate SpecAugment-based spectral augmentation and self-attention mechanisms to generate multi-view fused embeddings. The entire model is trained end-to-end to maximize semantic consistency and generalization. Contribution/Results: Our approach achieves state-of-the-art performance on the MetaAudio benchmark under the 5-way 5-shot setting, significantly outperforming existing few-shot audio classification methods. It establishes a new paradigm for low-resource audio understanding by synergistically combining supervised contrastive learning, prototype-based inference, and robust feature augmentation.
📝 Abstract
Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.