EmoBench-Reddit: A Hierarchical Benchmark for Evaluating the Emotional Intelligence of Multimodal Large Language Models

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) benchmarks predominantly focus on objective visual question answering or caption generation, critically neglecting the evaluation of complex subjective affective understanding—such as humor, irony, and sorrow. To address this gap, we propose the first Reddit-based hierarchical multimodal affective intelligence benchmark, structured along a perception-to-cognition progression and integrating visual, linguistic, and contextual reasoning. Leveraging Claude 4-assisted annotation augmented by rigorous human validation, we construct a high-quality dataset of 350 instances, enabling systematic assessment across emotion recognition, scene reasoning, intent inference, and empathic response generation. This benchmark introduces the first fine-grained, real-world social media–grounded affective annotations and employs a hybrid question format—combining multiple-choice and open-ended questions—to substantially enhance ecological validity and theoretical rigor in affective understanding evaluation.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of Multimodal Large Language Models (MLLMs), they have demonstrated exceptional capabilities across a variety of vision-language tasks. However, current evaluation benchmarks predominantly focus on objective visual question answering or captioning, inadequately assessing the models' ability to understand complex and subjective human emotions. To bridge this gap, we introduce EmoBench-Reddit, a novel, hierarchical benchmark for multimodal emotion understanding. The dataset comprises 350 meticulously curated samples from the social media platform Reddit, each containing an image, associated user-provided text, and an emotion category (sad, humor, sarcasm, happy) confirmed by user flairs. We designed a hierarchical task framework that progresses from basic perception to advanced cognition, with each data point featuring six multiple-choice questions and one open-ended question of increasing difficulty. Perception tasks evaluate the model's ability to identify basic visual elements (e.g., colors, objects), while cognition tasks require scene reasoning, intent understanding, and deep empathy integrating textual context. We ensured annotation quality through a combination of AI assistance (Claude 4) and manual verification.

Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal models' emotional intelligence capabilities

Evaluating understanding of complex subjective human emotions

Bridging gap in emotion-focused multimodal evaluation benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark for multimodal emotion evaluation

Reddit dataset with image-text-emotion samples

AI-assisted and manual annotation quality assurance

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

2024-06-12Citations: 0