Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal Emotion Recognition in Conversations (MERC) aims to enhance fine-grained and natural emotion understanding in human–computer interaction by fusing textual, acoustic, and visual modalities. However, existing unimodal approaches struggle with cross-modal asynchrony and context dependency inherent in dynamic conversational settings. This paper presents the first systematic survey of MERC research: it clarifies task formulations, evaluation benchmarks, and historical evolution; establishes a structured taxonomy encompassing feature- and decision-level fusion, attention mechanisms, graph neural networks, cross-modal contrastive learning, and multi-task learning; identifies core challenges—including cross-modal alignment and dynamic contextual modeling; and analyzes current performance bottlenecks and benchmark gaps. The survey provides a rigorous theoretical framework and actionable technical roadmap for advancing emotion-aware dialogue systems.

Technology Category

Application Category

📝 Abstract
While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.
Problem

Research questions and friction points this paper is trying to address.

Enhancing emotional understanding in human-computer interaction through multimodal integration
Surveying methods and challenges in Multimodal Emotion Recognition in Conversations
Providing guidance for future research in emotionally intelligent systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates text, speech, visual signals
Systematic overview of MERC methods
Highlights trends, challenges, future directions
🔎 Similar Papers
No similar papers found.
Chengyan Wu
Chengyan Wu
South China Normal University & SUN YAT-SEN University
Machine LearningNLPMultilingual IELLM AlignmentMultimodal
Y
Yiqiang Cai
Guangdong Provincial Key Laboratory of Quantum Engineering and Quantum Materials, School of Electronic Science and Engineering (School of Microelectronics), South China Normal University
Y
Yang Liu
North Carolina Central University
P
Pengxu Zhu
Georgia Institute of Technology
Y
Yun Xue
Guangdong Provincial Key Laboratory of Quantum Engineering and Quantum Materials, School of Electronic Science and Engineering (School of Microelectronics), South China Normal University
Ziwei Gong
Ziwei Gong
Ph.D. candidate, Columbia University
NLPSpeech
Julia Hirschberg
Julia Hirschberg
Columbia University
Spoken Language ProcessingNatural Language ProcessingProsody
Bolei Ma
Bolei Ma
LMU Munich
LinguisticsNatural Language ProcessingComputational Social Science