🤖 AI Summary
This work addresses the challenge of zero-shot classification for unseen species in animal sound recognition by introducing a large-scale dataset comprising recordings from 6,823 species (totaling 4,225 hours) along with 22 ecological attributes. The authors propose a novel language–audio pretraining model that integrates biological taxonomic hierarchy into the cross-modal alignment mechanism. By embedding the species taxonomy directly into the language–audio contrastive learning framework, the model achieves aligned multimodal representations capable of both zero-shot species identification and inference of ecological traits. Experimental results demonstrate that the proposed approach significantly outperforms existing baselines, including CLAP, on tasks involving unseen species recognition and ecological attribute prediction.
📝 Abstract
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.