🤖 AI Summary
Multimodal Emotion Recognition in Conversations (MERC) aims to enhance fine-grained and natural emotion understanding in human–computer interaction by fusing textual, acoustic, and visual modalities. However, existing unimodal approaches struggle with cross-modal asynchrony and context dependency inherent in dynamic conversational settings. This paper presents the first systematic survey of MERC research: it clarifies task formulations, evaluation benchmarks, and historical evolution; establishes a structured taxonomy encompassing feature- and decision-level fusion, attention mechanisms, graph neural networks, cross-modal contrastive learning, and multi-task learning; identifies core challenges—including cross-modal alignment and dynamic contextual modeling; and analyzes current performance bottlenecks and benchmark gaps. The survey provides a rigorous theoretical framework and actionable technical roadmap for advancing emotion-aware dialogue systems.
📝 Abstract
While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.