🤖 AI Summary
Existing end-to-end spoken dialogue systems face three key bottlenecks in empathic interaction: heavy reliance on large-scale annotated data, inadequate modeling of paralinguistic cues (e.g., emotion, gender, age), and the absence of dedicated empathic speech datasets and evaluation benchmarks. To address these challenges, we propose an understanding-driven empathic dialogue modeling framework featuring a three-stage progressive training strategy and a language–paralanguage dual-path reasoning mechanism, enabling fine-grained emotional understanding and natural response generation under resource constraints. Built upon an end-to-end spoken large language model, our approach integrates paralinguistic perception with chain-of-thought reasoning to construct EChat-200K—the first high-quality dataset for empathic spoken dialogue—and its corresponding evaluation benchmark, EChat-eval. Experiments demonstrate that our method significantly outperforms existing end-to-end systems in empathic response quality.
📝 Abstract
Empathy is crucial in enabling natural interactions within spoken dialogue systems, allowing machines to recognize and respond appropriately to paralinguistic cues such as age, gender, and emotion. Recent advancements in end-to-end speech language models, which unify speech understanding and generation, provide promising solutions. However, several challenges persist, including an over-reliance on large-scale dialogue datasets, insufficient extraction of paralinguistic cues vital for conveying empathy, and the lack of empathy-specific datasets and evaluation frameworks. To address these issues, we introduce OSUM-EChat, an open-source, end-to-end spoken dialogue system designed to enhance empathetic interactions, particularly in resource-limited settings. OSUM-EChat introduces two key innovations: (1) a three-stage understanding-driven spoken dialogue training strategy that extends the capabilities of a large speech understanding model to spoken dialogue tasks, and (2) a linguistic-paralinguistic dual thinking mechanism that integrates paralinguistic understanding through a chain of thought with dialogue generation, enabling the system to produce more empathetic responses. This approach reduces reliance on large-scale dialogue datasets while maintaining high-quality empathetic interactions. Additionally, we introduce the EChat-200K dataset, a rich corpus of empathetic speech-to-speech dialogues, and the EChat-eval benchmark, a comprehensive framework for evaluating the empathetic capabilities of dialogue systems. Experimental results demonstrate that OSUM-EChat outperforms end-to-end spoken dialogue models regarding empathetic responsiveness, validating its effectiveness.