🤖 AI Summary
Existing multimodal empathic response generation systems struggle to simultaneously achieve cross-modal affective understanding and speaker identity consistency. This paper proposes an explicit emotion-driven three-stage architecture—empathic understanding, memory retrieval, and response generation—that unifies textual, acoustic, and visual affective signals within a multimodal large language model (MLLM), enabling zero-shot or few-shot cross-modal affective alignment and identity preservation without additional fine-tuning. The system integrates an MLLM, expressive text-to-speech synthesis, and video generation modules in an end-to-end fashion, significantly enhancing the emotional accuracy and naturalness of empathic responses. Evaluated on the ACM MM 2025 Avatar Multimodal Empathy Challenge, our approach ranks first, demonstrating state-of-the-art performance and practical efficacy in low-resource settings.
📝 Abstract
Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.