MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) lack a standardized benchmark for multi-object emotion analysis (MOEA). To address this gap, we introduce MOSABench—the first dedicated benchmark for MOEA—comprising ~1,000 real-world images with multiple objects, each requiring fine-grained, object-level emotion classification. Our method innovatively incorporates distance-aware object annotation, structured prompting, and a unified response parsing mechanism; we further propose a distance-weighted F1 score to precisely quantify emotion understanding errors induced by spatial object distribution. Rigorous human-annotated evaluation reveals consistent deficiencies across leading MLLMs (e.g., mPLUG-Owl, Qwen-VL2), including attention dispersion and sharply degraded performance on distant objects. MOSABench establishes a reproducible, diagnosable benchmark that enables targeted model diagnosis and improvement for MOEA.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for multi-object sentiment analysis
Need to evaluate MLLMs on complex images with multiple objects
Current models show limitations in spatial distance handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distance-based target annotation for object sentiment
Post-processing to standardize model output evaluation
Improved scoring mechanism for multi-object sentiment analysis
🔎 Similar Papers
No similar papers found.
Shezheng Song
Shezheng Song
NUDT
C
Chengxiang He
School of Computer and Information Engineering, Hefei University of Technology, Hefei 230009, China
S
Shan Zhao
School of Computer and Information Engineering, Hefei University of Technology, Hefei 230009, China
Chengyu Wang
Chengyu Wang
Alibaba Group
Natural Language ProcessingLarge Language ModelMulti-modal Learning
Q
Qian Wan
CCNU, Wuhan, China
T
Tianwei Yan
M
Meng Wang
School of Computer and Information Engineering, Hefei University of Technology, Hefei 230009, China
X
Xiaopeng Li
S
Shasha Li
J
Jun Ma
J
Jie Yu
X
Xiaoguang Mao