WhAM: Towards A Translative Model of Sperm Whale Vocalization

📅 2025-12-01
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the need for high-fidelity, biologically consistent synthetic sperm whale click sequences (codas) to advance modeling of their social communication mechanisms. Method: We propose the first Transformer-based generative framework tailored to non-human vocalizations, built upon the music pre-trained model VampNet. Leveraging transfer learning and a novel masked phoneme modeling–autoregressive joint decoding strategy, the model enables cross-modal synthesis from arbitrary audio prompts to codas. It is fine-tuned on 10,000 field-recorded codas spanning two decades. Contribution/Results: Evaluations show that generated codas significantly outperform baselines in expert perceptual ratings and Fréchet Audio Distance. The model also demonstrates strong representational capacity in rhythm structure identification, social unit attribution, and vowel-analogy classification. This work pioneers deep generative modeling for cetacean acoustic synthesis, establishing a new paradigm for bioacoustic research and conservation.

Technology Category

Application Category

📝 Abstract
Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fr'echet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic sperm whale codas from audio prompts
Evaluates synthetic codas using acoustic metrics and expert studies
Learns representations effective for downstream classification tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for sperm whale vocalization generation
Finetuned VampNet on 10k coda recordings
Generates synthetic codas via masked token prediction
🔎 Similar Papers
No similar papers found.
Orr Paradise
Orr Paradise
EPFL
Theoretical Computer ScienceMachine Learning
P
Pranav Muralikrishnan
UC Berkeley
L
Liangyuan Chen
UC Berkeley
H
H. F. García
Northwestern University
Bryan Pardo
Bryan Pardo
Computer Science, Northwestern University
AudioMusicMachine LearningHCIMusic information retrieval
Roee Diamant
Roee Diamant
Haifa University
D
David F. Gruber
City University of New York
Shane Gero
Shane Gero
Carleton University
S
Shafi Goldwasser
UC Berkeley