π€ AI Summary
Existing affective computing research is hindered by the scarcity of high-quality multimodal data, challenges in modeling fine-grained emotional semantics across modalities, and overly narrow evaluation paradigms. To address these limitations, we introduce MERβthe first benchmark dedicated to emotion understanding for multimodal large language models (MLLMs). Our contributions include: (1) MER-Caption, a large-scale descriptive emotion dataset comprising 115K samples spanning 2,000+ fine-grained emotion categories, constructed via a model-assisted crowdsourcing annotation pipeline; (2) AffectGPT, a pre-fusion MLLM featuring explicit cross-modal alignment mechanisms; and (3) MER-UniBench, a unified evaluation framework supporting both free-form text generation and conventional classification tasks. Extensive experiments demonstrate that AffectGPT consistently outperforms state-of-the-art methods across diverse emotion understanding tasks. All code, models, and datasets are publicly released to advance emotion understanding from coarse-grained classification toward deep semantic generation.
π Abstract
The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level-from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption), and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for both typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results demonstrate AffectGPT's robust performance across various MER tasks. We are publicly releasing both the AffectGPT model and the MER-Caption dataset to foster further research and development in emotion understanding.