GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing end-to-end spoken language models treat speech as a plain text carrier, neglecting paralinguistic and speaker-specific attributes—such as dialect, age, emotion, and non-linguistic vocalizations—thereby limiting natural, human-like interaction. Method: We propose a paralinguistically and speaker-aware spoken language model featuring a dual-modal output architecture that decouples linguistic modeling from acoustic representation; a modular, stage-wise training strategy enabling joint learning of semantic and non-semantic information; and integration of end-to-end modeling, multi-task learning, and cross-modal alignment on large-scale speech-text corpora. Results: Evaluated on the TELEVAL benchmark, our model achieves balanced, state-of-the-art performance across both semantic understanding and non-semantic tasks—including emotion recognition, dialect adaptation, and age-sensitive interaction—significantly outperforming open-source baselines. To our knowledge, this is the first work to systematically realize joint modeling and generation of multidimensional paralinguistic cues in spontaneous speech.

Technology Category

Application Category

📝 Abstract

Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.

Problem

Research questions and friction points this paper is trying to address.

Enhances SLMs to capture paralinguistic cues like emotion and dialect

Decouples linguistic modeling from acoustic realization for adaptive speech

Improves handling of non-semantic tasks like age-sensitive interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-modality head decouples linguistic and acoustic modeling

Modular staged training aligns multi-level speech information

Balanced performance on semantic and non-semantic tasks

🔎 Similar Papers

No similar papers found.