๐ค AI Summary
Marmoset vocalization data suffer from high noise levels, sparse annotations, and poor structural organization, hindering joint modeling of vocal segmentation, classification, and caller identification. Method: This work introduces the Transformer architecture to marmoset acoustic analysis for the first time, proposing an end-to-end multi-task learning model that takes spectrograms as input. Leveraging self-attention mechanisms, the model explicitly captures long-range temporal dependencies and cross-vocal-unit relationships, overcoming CNNsโ limitations in global structural modeling. Results: Evaluated on real-world low-resource marmoset recordings, the model jointly optimizes three tasksโvocal segmentation (F1 score โ), call-type classification (accuracy โ), and caller identification (caller ID accuracy โ)โwith all metrics significantly outperforming CNN baselines. This study establishes a scalable, transformer-based acoustic analysis paradigm for investigating social communication and language development mechanisms in non-human primates.
๐ Abstract
Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.