Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Prior work lacks systematic evaluation of multimodal large language models (MLLMs) on fine-grained emotion understanding in open-vocabulary multimodal emotion recognition (MER-OV). Method: We introduce the first large-scale MER-OV benchmark—built upon the OV-MERD dataset—and comprehensively evaluate 19 state-of-the-art MLLMs across audio, video, and text modalities. We propose a multidimensional evaluation paradigm covering reasoning analysis, modality fusion, context utilization, and prompt engineering. Contribution/Results: Our study reveals that two-stage trimodal fusion is optimal, with video contributing most to performance; open- and closed-source MLLMs exhibit negligible performance gaps. Our framework achieves new state-of-the-art results on MER-OV. We publicly release code, models, and practical guidelines to advance interpretable, fine-grained affective AI.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable multi- and cross-modal integration capabilities. However, their potential for fine-grained emotion understanding remains systematically underexplored. While open-vocabulary multimodal emotion recognition (MER-OV) has emerged as a promising direction to overcome the limitations of closed emotion sets, no comprehensive evaluation of MLLMs in this context currently exists. To address this, our work presents the first large-scale benchmarking study of MER-OV on the OV-MERD dataset, evaluating 19 mainstream MLLMs, including general-purpose, modality-specialized, and reasoning-enhanced architectures. Through systematic analysis of model reasoning capacity, fusion strategies, contextual utilization, and prompt design, we provide key insights into the capabilities and limitations of current MLLMs for MER-OV. Our evaluation reveals that a two-stage, trimodal (audio, video, and text) fusion approach achieves optimal performance in MER-OV, with video emerging as the most critical modality. We further identify a surprisingly narrow gap between open- and closed-source LLMs. These findings establish essential benchmarks and offer practical guidelines for advancing open-vocabulary and fine-grained affective computing, paving the way for more nuanced and interpretable emotion AI systems. Associated code will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking multimodal large models for open-vocabulary emotion recognition

Evaluating fine-grained emotion understanding across diverse model architectures

Identifying optimal fusion strategies for audio, video, and text modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage trimodal fusion for emotion recognition

Benchmarking 19 MLLMs on open-vocabulary emotion dataset

Video identified as the most critical modality

🔎 Similar Papers

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition