🤖 AI Summary
Current purely text-based large language models (LLMs) suffer from fundamental limitations in studying key linguistic phenomena—such as speech perception, prosody, dialectal variation, and child language acquisition—due to their inability to model the speech modality. This paper systematically establishes, for the first time, the indispensable role of audio-driven deep models in linguistic theory construction and proposes a novel “technical model–linguistic interpretation” bidirectional co-development paradigm. Methodologically, we integrate speech representation learning, self-supervised audio models (e.g., wav2vec 2.0, Whisper), neurobehavioral modeling, and cross-modal interpretability analysis. Our core contribution is the formalization of speech deep models as foundational infrastructure for language cognition modeling, thereby enabling deep methodological and theoretical integration between computational and experimental linguistics, and providing essential support for embodied, multimodal theories of language cognition. (149 words)
📝 Abstract
Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.