🤖 AI Summary
This study investigates whether animal sequential vocalizations exhibit human-like hierarchical long-range dependencies and whether Transformer models can effectively capture such biological communication patterns. To this end, we introduce FinchGPT—the first Transformer-based model specifically designed for avian song—trained on text-encoded zebra finch vocalizations. Innovatively adapting the large language model paradigm to ethology, we conduct dual-path validation: computational analysis of attention weights and neurobiological validation via targeted brain lesion experiments. Our findings demonstrate: (1) syntax-like dependencies spanning数十 syllables in song sequences; (2) FinchGPT’s superior performance over RNN and CNN baselines; and (3) strong alignment between FinchGPT’s attention mechanisms and neural responses in songbird brain regions implicated in syntactic processing. These results provide convergent computational and neurobiological evidence for structural complexity in animal communication and support hierarchical computation in the avian brain.
📝 Abstract
The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT's attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model's outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.