🤖 AI Summary
This study investigates the cross-domain transferability of speech self-supervised models (HuBERT, WavLM, XEUS) to bioacoustic tasks—addressing the critical challenge of scarce labeled data in animal sound detection and classification. We propose a time-aware representation analysis and noise-robustness evaluation framework, integrating linear probing, temporal downstream modeling, and frequency-band sensitivity analysis to systematically assess how different pretraining strategies affect cross-species sound recognition. Experiments across multiple animal audio datasets show that fine-tuning only the top layers achieves performance comparable to domain-specific bioacoustic models; notably, noise-robust pretraining substantially enhances generalization under low signal-to-noise ratio conditions. To our knowledge, this is the first systematic validation of speech self-supervised representations for non-speech bioacoustic applications. Our work establishes a new paradigm for low-cost, robust cross-domain acoustic modeling, demonstrating that speech-derived representations serve as effective, general-purpose features for bioacoustics.
📝 Abstract
Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.