Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic evaluation of multimodal large language models (MLLMs) on fine-grained emotion understanding in open-vocabulary multimodal emotion recognition (MER-OV). Method: We introduce the first large-scale MER-OV benchmark—built upon the OV-MERD dataset—and comprehensively evaluate 19 state-of-the-art MLLMs across audio, video, and text modalities. We propose a multidimensional evaluation paradigm covering reasoning analysis, modality fusion, context utilization, and prompt engineering. Contribution/Results: Our study reveals that two-stage trimodal fusion is optimal, with video contributing most to performance; open- and closed-source MLLMs exhibit negligible performance gaps. Our framework achieves new state-of-the-art results on MER-OV. We publicly release code, models, and practical guidelines to advance interpretable, fine-grained affective AI.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable multi- and cross-modal integration capabilities. However, their potential for fine-grained emotion understanding remains systematically underexplored. While open-vocabulary multimodal emotion recognition (MER-OV) has emerged as a promising direction to overcome the limitations of closed emotion sets, no comprehensive evaluation of MLLMs in this context currently exists. To address this, our work presents the first large-scale benchmarking study of MER-OV on the OV-MERD dataset, evaluating 19 mainstream MLLMs, including general-purpose, modality-specialized, and reasoning-enhanced architectures. Through systematic analysis of model reasoning capacity, fusion strategies, contextual utilization, and prompt design, we provide key insights into the capabilities and limitations of current MLLMs for MER-OV. Our evaluation reveals that a two-stage, trimodal (audio, video, and text) fusion approach achieves optimal performance in MER-OV, with video emerging as the most critical modality. We further identify a surprisingly narrow gap between open- and closed-source LLMs. These findings establish essential benchmarks and offer practical guidelines for advancing open-vocabulary and fine-grained affective computing, paving the way for more nuanced and interpretable emotion AI systems. Associated code will be made publicly available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking multimodal large models for open-vocabulary emotion recognition
Evaluating fine-grained emotion understanding across diverse model architectures
Identifying optimal fusion strategies for audio, video, and text modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage trimodal fusion for emotion recognition
Benchmarking 19 MLLMs on open-vocabulary emotion dataset
Video identified as the most critical modality
🔎 Similar Papers
No similar papers found.
Jing Han
Jing Han
University of Cambridge
deep learningaudio signal processingmachine learningmHealthaffective computing
Z
Zhiqiang Gao
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
S
Shihao Gao
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Jialing Liu
Jialing Liu
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
H
Hongyu Chen
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Zixing Zhang
Zixing Zhang
Professor, Hunan University
Artifical IntelligenceSpeech ProcessingAffective ComputingDigital HealthAutomatic Speech Recognition
B
Björn W. Schuller
GLAM – the Group on Language, Audio, and Music, Imperial College London, SW7 2BX London, U.K., and also with CHI – the Chair of Health Informatics at TUM University Hospital, the MCML – Munich Center for Machine Learning, and the MDSI – Munich Data Science Institute, all in Munich, Germany