E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing multimodal empathic response generation systems struggle to simultaneously achieve cross-modal affective understanding and speaker identity consistency. This paper proposes an explicit emotion-driven three-stage architecture—empathic understanding, memory retrieval, and response generation—that unifies textual, acoustic, and visual affective signals within a multimodal large language model (MLLM), enabling zero-shot or few-shot cross-modal affective alignment and identity preservation without additional fine-tuning. The system integrates an MLLM, expressive text-to-speech synthesis, and video generation modules in an end-to-end fashion, significantly enhancing the emotional accuracy and naturalness of empathic responses. Evaluated on the ACM MM 2025 Avatar Multimodal Empathy Challenge, our approach ranks first, demonstrating state-of-the-art performance and practical efficacy in low-resource settings.

Technology Category

Application Category

📝 Abstract

Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.

Problem

Research questions and friction points this paper is trying to address.

Handling multimodal emotional content in empathetic response generation

Maintaining identity consistency in emotionally intelligent interactions

Decomposing MERG into understanding, memory retrieval, and response generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM for empathetic response generation

Decomposes task into understanding, memory, generation

Integrates speech and video models without retraining

🔎 Similar Papers

No similar papers found.