Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Learning end-to-end embeddings for entire bird song bouts remains challenging due to the absence of fine-grained annotations and the need to jointly capture both species-level generalizations and individual-level discriminative acoustic features. Method: This paper proposes a self-supervised disentangled representation learning framework featuring dual encoders—separately modeling shared (species-level) and private (individual-level) acoustic factors—without relying on note- or syllable-level segmentation. Contribution/Results: To our knowledge, this is the first work to extend disentanglement learning to the full-bout level, enabling interpretable embedding analysis and lossless dimensionality reduction. Evaluated on a great tit dataset, our method significantly outperforms pretrained models and standard VAEs in clustering performance. Critically, even after aggressive dimensionality compression, the learned embeddings retain strong discriminative power, empirically validating both the effectiveness and biological plausibility of the disentangled representations.

Technology Category

Application Category

📝 Abstract
This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the note or syllable level. To extend the processing level to the entire song instead of cutting into segments, this paper regards each vocalization as the generalized and discriminative part and uses two encoders to learn these two parts. The proposed method is evaluated on the Great Tits dataset according to the clustering performance, and the results outperform the compared pre-trained models and vanilla VAE. Finally, this paper analyzes the informative part of the embedding, further compresses its dimension, and explains the disentangled performance of bird vocalizations.
Problem

Research questions and friction points this paper is trying to address.

Automatic Recognition
Birdsong Analysis
Feature Extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Representation Learning
Avian Vocalization Analysis
Feature Extraction
🔎 Similar Papers
No similar papers found.