GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end spoken language models treat speech as a plain text carrier, neglecting paralinguistic and speaker-specific attributes—such as dialect, age, emotion, and non-linguistic vocalizations—thereby limiting natural, human-like interaction. Method: We propose a paralinguistically and speaker-aware spoken language model featuring a dual-modal output architecture that decouples linguistic modeling from acoustic representation; a modular, stage-wise training strategy enabling joint learning of semantic and non-semantic information; and integration of end-to-end modeling, multi-task learning, and cross-modal alignment on large-scale speech-text corpora. Results: Evaluated on the TELEVAL benchmark, our model achieves balanced, state-of-the-art performance across both semantic understanding and non-semantic tasks—including emotion recognition, dialect adaptation, and age-sensitive interaction—significantly outperforming open-source baselines. To our knowledge, this is the first work to systematically realize joint modeling and generation of multidimensional paralinguistic cues in spontaneous speech.

Technology Category

Application Category

📝 Abstract
Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.
Problem

Research questions and friction points this paper is trying to address.

Enhances SLMs to capture paralinguistic cues like emotion and dialect
Decouples linguistic modeling from acoustic realization for adaptive speech
Improves handling of non-semantic tasks like age-sensitive interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-modality head decouples linguistic and acoustic modeling
Modular staged training aligns multi-level speech information
Balanced performance on semantic and non-semantic tasks
🔎 Similar Papers
No similar papers found.
H
Hongjie Chen
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Zehan Li
Zehan Li
PhD, UTHealth Houston
AI for Mental HealthPsychiatryBiomedical InformaticsLLMsClinical Phenotyping
Y
Yaodong Song
Institute of Artificial Intelligence (TeleAI), China Telecom, China
W
Wenming Deng
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Y
Yitong Yao
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Y
Yuxin Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom, China
H
Hang Lv
Institute of Artificial Intelligence (TeleAI), China Telecom, China
X
Xuechao Zhu
Institute of Artificial Intelligence (TeleAI), China Telecom, China
J
Jian Kang
Institute of Artificial Intelligence (TeleAI), China Telecom, China
J
Jie Lian
Institute of Artificial Intelligence (TeleAI), China Telecom, China
J
Jie Li
Institute of Artificial Intelligence (TeleAI), China Telecom, China
C
Chao Wang
Institute of Artificial Intelligence (TeleAI), China Telecom, China
S
Shuangyong Song
Institute of Artificial Intelligence (TeleAI), China Telecom, China
Yongxiang Li
Yongxiang Li
Professor, RMIT University
Electronic Materials and Devices
Z
Zhongjiang He
Institute of Artificial Intelligence (TeleAI), China Telecom, China