EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) lack systematic evaluation of image-evoked emotion understanding, with prevailing benchmarks suffering from coarse granularity and narrow task scope. Method: We introduce EmoMLLM—the first dedicated benchmark for image emotion understanding—comprising four task categories: perception, fine-grained ranking, descriptive generation, and holistic assessment, augmented by pairwise image comparison. We propose a VAD (Valence-Arousal-Dominance)-based fine-grained emotion ranking annotation paradigm, supported by 1,960 human-annotated images and 6,773 QA pairs. Additionally, we design single- and dual-sample joint analysis tasks to enable multidimensional capability evaluation. Results: Empirical evaluation across 19 state-of-the-art MLLMs reveals significant deficiencies in fine-grained emotion discrimination and comparative reasoning. EmoMLLM establishes a reproducible, extensible evaluation infrastructure for machine empathy modeling.

Technology Category

Application Category

📝 Abstract
The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs' ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model's proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
Problem

Research questions and friction points this paper is trying to address.

Assessing image-evoked emotions in multi-modal large language models
Lacking systematic evaluation for emotion understanding in MLLMs
Developing a benchmark to enhance MLLMs' emotion perception capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion ranking strategy with VAD attributes
Four tasks for MLLM emotion evaluation
Image-pairwise analysis for joint comparison
🔎 Similar Papers
2024-05-14IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 2
Lancheng Gao
Lancheng Gao
Shanghai Jiaotong University
Ziheng Jia
Ziheng Jia
Shanghai Jiaotong University / Shanghai AILab
LLM and LMM on Visual Quality Assessment
Y
Yunhao Zeng
Shanghai Jiaotong University, Shanghai, China
W
Wei Sun
Shanghai Jiaotong University, Shanghai, China
Y
Yiming Zhang
Shanghai Jiaotong University, Shanghai, China
W
Wei Zhou
Cardiff University, Cardiff, UK
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays
X
Xiongkuo Min
Shanghai Jiaotong University, Shanghai, China