GmSLM : Generative Marmoset Spoken Language Modeling

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the neural mechanisms underlying vocal communication in marmosets—nonhuman primates exhibiting language-like, socially contingent features (e.g., vocal labeling, turn-taking) not observed in most nonhuman species. Method: We propose the Generative Marmoset Spoken Language Model (GmSLM), a generative modeling framework integrating unsupervised field audio recordings with weakly annotated dialogues to build an end-to-end speech-language modeling pipeline for joint acoustic representation learning and vocal sequence generation. We further introduce a novel zero-shot evaluation metric that discriminates real from synthetic dialogues without manual annotations. Results: GmSLM produces acoustically realistic calls, and downstream classification of real versus synthetic dialogues achieves significantly higher accuracy than baseline models. To our knowledge, this is the first application of generative modeling to vocal communication in nonhuman primates, establishing a scalable computational bridge linking vocal behavior to neural activity analysis—and overcoming key limitations of large language models in animal vocalization modeling.

Technology Category

Application Category

📝 Abstract
Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.
Problem

Research questions and friction points this paper is trying to address.

Modeling marmoset vocal communication with generative AI
Developing unsupervised evaluation metrics for animal vocalizations
Linking vocalization patterns with neural brain activity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Marmoset Spoken Language Modeling
Zero-shot evaluation with unsupervised data
Acoustically matching vocalization generation
🔎 Similar Papers
No similar papers found.
T
Talia Sternberg
The School of Computer Science and Engineering, Hebrew University of Jerusalem
M
Michael London
The Edmond and Lily Safra center for Brain Sciences (ELSC), Hebrew University of Jerusalem
D
David Omer
The Edmond and Lily Safra center for Brain Sciences (ELSC), Hebrew University of Jerusalem
Yossi Adi
Yossi Adi
The Hebrew University of Jerusalem
Machine LearningAISpoken Language ModelingAudio Speech and Language Processing