E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal empathic response generation systems struggle to simultaneously achieve cross-modal affective understanding and speaker identity consistency. This paper proposes an explicit emotion-driven three-stage architecture—empathic understanding, memory retrieval, and response generation—that unifies textual, acoustic, and visual affective signals within a multimodal large language model (MLLM), enabling zero-shot or few-shot cross-modal affective alignment and identity preservation without additional fine-tuning. The system integrates an MLLM, expressive text-to-speech synthesis, and video generation modules in an end-to-end fashion, significantly enhancing the emotional accuracy and naturalness of empathic responses. Evaluated on the ACM MM 2025 Avatar Multimodal Empathy Challenge, our approach ranks first, demonstrating state-of-the-art performance and practical efficacy in low-resource settings.

Technology Category

Application Category

📝 Abstract
Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.
Problem

Research questions and friction points this paper is trying to address.

Handling multimodal emotional content in empathetic response generation
Maintaining identity consistency in emotionally intelligent interactions
Decomposing MERG into understanding, memory retrieval, and response generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM for empathetic response generation
Decomposes task into understanding, memory, generation
Integrates speech and video models without retraining
🔎 Similar Papers
No similar papers found.
Ronghao Lin
Ronghao Lin
University of Science and Technology of China
Waveform DesignSparse Array DesignStatistical Signal ProcessingOptimization Theory.
Shuai Shen
Shuai Shen
Nanyang Technological University
Computer VisionVisual Generation
W
Weipeng Hu
Nanyang Technological University
Q
Qiaolin He
Sun Yat-sen University
A
Aolin Xiong
Sun Yat-sen University
L
Li Huang
Desay SV Automotive Co., Ltd
Haifeng Hu
Haifeng Hu
Sun Yat-sen University
Y
Yap-peng Tan
Nanyang Technological University