Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

📅 2025-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models struggle to interpret prosody and affective cues in spoken dialogue, resulting in inadequate empathic responses. This paper proposes the Listen-Perceive-Express (LPE) two-stage framework—the first to integrate chain-of-thought (CoT) reasoning into speech-driven empathic dialogue generation. In the first stage, speech-text multimodal alignment enables content understanding and emotion perception; in the second stage, CoT prompting guides empathic response formulation. Crucially, LPE requires no paired speech-question-answer annotations—only unpaired speech utterances and corresponding text dialogues suffice for end-to-end empathic response modeling. Evaluated on multiple spoken empathic dialogue benchmarks, LPE significantly outperforms strong baselines; human evaluation shows a 32% improvement in empathic quality, demonstrating both the efficacy and generalizability of CoT-guided empathic generation under unsupervised speech QA conditions.

Technology Category

Application Category

📝 Abstract
Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model's potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Oral Dialogue Processing
Emotional Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Listening, Perceiving, and Expressing Method
Thinking Chain Technology
Empathetic Response in Human-Computer Dialogue
🔎 Similar Papers
No similar papers found.
J
Jingran Xie
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China
Shun Lei
Shun Lei
PhD student, Tsinghua University
Speech synthesisMusic generationDance GenerationSinging Voice Synthesis
Y
Yue Yu
Pengcheng Laboratory, Shenzhen, China
Y
Yang Xiang
Pengcheng Laboratory, Shenzhen, China
H
Hui Wang
Pengcheng Laboratory, Shenzhen, China
Xixin Wu
Xixin Wu
The Chinese University of Hong Kong
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; The Chinese University of Hong Kong, Hong Kong SAR, China