🤖 AI Summary
It remains unclear whether speech-based self-supervised pre-trained models can effectively transfer to bioacoustic tasks—challenging the prevailing assumption that domain-specific pretraining on animal vocalizations is necessary. Method: We systematically evaluate speech models (Wav2Vec 2.0, HuBERT) against animal-call-specific models (e.g., AVES) across diverse bioacoustic datasets (FreeSound, BirdVox) on species identification and sound event detection, with and without ASR fine-tuning. Contribution/Results: Speech models match or surpass animal-call-specific models on most tasks; ASR fine-tuning yields marginal gains, indicating that general acoustic representations already possess strong bioacoustic adaptability. This work provides the first empirical evidence refuting the necessity of domain-specific pretraining for bioacoustics, establishing a “light-fine-tuning, high-efficiency” paradigm. It offers a scalable, resource-efficient methodology for low-data animal acoustic modeling, advancing transfer learning in ecological audio analysis.
📝 Abstract
Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.