A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization

📅 2024-10-30

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Marmoset vocalization data suffer from high noise levels, sparse annotations, and poor structural organization, hindering joint modeling of vocal segmentation, classification, and caller identification. Method: This work introduces the Transformer architecture to marmoset acoustic analysis for the first time, proposing an end-to-end multi-task learning model that takes spectrograms as input. Leveraging self-attention mechanisms, the model explicitly captures long-range temporal dependencies and cross-vocal-unit relationships, overcoming CNNs’ limitations in global structural modeling. Results: Evaluated on real-world low-resource marmoset recordings, the model jointly optimizes three tasks—vocal segmentation (F1 score ↑), call-type classification (accuracy ↑), and caller identification (caller ID accuracy ↑)—with all metrics significantly outperforming CNN baselines. This study establishes a scalable, transformer-based acoustic analysis paradigm for investigating social communication and language development mechanisms in non-human primates.

Technology Category

Application Category

📝 Abstract

Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.

Problem

Research questions and friction points this paper is trying to address.

Segmenting highly variable marmoset vocalizations in noisy conditions

Classifying and identifying marmoset callers with limited annotated data

Modeling long-range temporal dependencies in non-human primate communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Autoencoder for self-supervised pretraining

Transformers with self-attention for global dependencies

Reconstruction of masked segments from unannotated recordings

🔎 Similar Papers

No similar papers found.