🤖 AI Summary
Existing end-to-end spoken language models treat speech as a plain text carrier, neglecting paralinguistic and speaker-specific attributes—such as dialect, age, emotion, and non-linguistic vocalizations—thereby limiting natural, human-like interaction.
Method: We propose a paralinguistically and speaker-aware spoken language model featuring a dual-modal output architecture that decouples linguistic modeling from acoustic representation; a modular, stage-wise training strategy enabling joint learning of semantic and non-semantic information; and integration of end-to-end modeling, multi-task learning, and cross-modal alignment on large-scale speech-text corpora.
Results: Evaluated on the TELEVAL benchmark, our model achieves balanced, state-of-the-art performance across both semantic understanding and non-semantic tasks—including emotion recognition, dialect adaptation, and age-sensitive interaction—significantly outperforming open-source baselines. To our knowledge, this is the first work to systematically realize joint modeling and generation of multidimensional paralinguistic cues in spontaneous speech.
📝 Abstract
Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.