Self-supervised Learning for Acoustic Few-Shot Classification

📅 2024-09-15

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the limited few-shot classification performance in low-resource bioacoustic scenarios caused by severe label scarcity, this paper proposes a synergistic framework integrating task-specific self-supervised pretraining with few-shot fine-tuning. Methodologically, it introduces the first hybrid architecture combining state-space models (S4/Mamba) with convolutional neural networks (CNNs), jointly capturing long-range temporal dynamics and local acoustic patterns. Crucially, contrastive learning is employed for task-aware self-supervised pretraining—eliminating reliance on domain-mismatched external pretrained models. Evaluated on standard benchmarks and real-world bioacoustic datasets under extreme 1–5-shot settings, the method achieves significant accuracy improvements over state-of-the-art approaches. It establishes an efficient, transferable paradigm for acoustic few-shot learning, demonstrating strong generalization with minimal labeled data.

Technology Category

Application Category

📝 Abstract

Labelled data are limited and self-supervised learning is one of the most important approaches for reducing labelling requirements. While it has been extensively explored in the image domain, it has so far not received the same amount of attention in the acoustic domain. Yet, reducing labelling is a key requirement for many acoustic applications. Specifically in bioacoustic, there are rarely sufficient labels for fully supervised learning available. This has led to the widespread use of acoustic recognisers that have been pre-trained on unrelated data for bioacoustic tasks. We posit that training on the actual task data and combining self-supervised pre-training with few-shot classification is a superior approach that has the ability to deliver high accuracy even when only a few labels are available. To this end, we introduce and evaluate a new architecture that combines CNN-based preprocessing with feature extraction based on state space models (SSMs). This combination is motivated by the fact that CNN-based networks alone struggle to capture temporal information effectively, which is crucial for classifying acoustic signals. SSMs, specifically S4 and Mamba, on the other hand, have been shown to have an excellent ability to capture long-range dependencies in sequence data. We pre-train this architecture using contrastive learning on the actual task data and subsequent fine-tuning with an extremely small amount of labelled data. We evaluate the performance of this proposed architecture for ($n$-shot, $n$-class) classification on standard benchmarks as well as real-world data. Our evaluation shows that it outperforms state-of-the-art architectures on the few-shot classification problem.

Problem

Research questions and friction points this paper is trying to address.

Reducing labeling needs in acoustic classification via self-supervised learning

Improving few-shot bioacoustic classification with task-specific pretraining

Enhancing temporal feature capture in acoustic signals using SSMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CNN and SSM for acoustic feature extraction

Uses contrastive learning for self-supervised pre-training

Fine-tunes with minimal labeled data for few-shot classification

🔎 Similar Papers

No similar papers found.