II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

📅 2024-06-09
🏛️ arXiv.org
📈 Citations: 4
Influential: 2
📄 PDF
🤖 AI Summary
This work addresses the critical deficiency of multimodal large language models (MLLMs) in comprehending high-level implicit image semantics—such as underlying intent, abstract concepts, and sentiment polarity—by introducing II-Bench, the first dedicated benchmark for this capability. II-Bench comprises human-crafted, multidimensional image–question pairs spanning three core tasks: semantic reasoning, affective attribution, and contextual inference, and supports prompt-sensitivity analysis. Evaluating 12 state-of-the-art MLLMs reveals a maximum accuracy of only 74.8%, substantially below human performance (mean 90%, peak 98%), exposing systematic limitations in abstraction, fine-grained visual grounding, and internal affective modeling. Notably, augmenting inputs with sentiment-aware prompts improves performance across most models, further diagnosing their deficient affective representation. This study is the first to formally define, operationalize, and quantitatively assess implicit image understanding in MLLMs, establishing a foundational benchmark and diagnostic toolkit for future research.

Technology Category

Application Category

📝 Abstract
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Advanced Image Understanding
Limitations in Abstract Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

II-Bench
Visual Understanding
Emotional Cues
🔎 Similar Papers
No similar papers found.
Ziqiang Liu
Ziqiang Liu
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language ProcessingLarge Language Model
Feiteng Fang
Feiteng Fang
University of Science and Technology of China
LLMNLP
X
Xi Feng
Shenzhen Institute of Advanced Technology, CAS; University of Science and Technology of China
Xinrun Du
Xinrun Du
Multimodal Art Projection Research Community, 01.ai
LLM
C
Chenhao Zhang
Shenzhen Institute of Advanced Technology, CAS; Huazhong University of Science and Technology
Z
Zekun Wang
Shenzhen Institute of Advanced Technology, CAS; 01.ai
Y
Yuelin Bai
Shenzhen Institute of Advanced Technology, CAS
Q
Qixuan Zhao
Shenzhen Institute of Advanced Technology, CAS; University of Science and Technology of China
L
Liyang Fan
Shenzhen Institute of Advanced Technology, CAS
C
Chengguang Gan
Yokohama National University
H
Hongquan Lin
Shenzhen Institute of Advanced Technology, CAS; University of Science and Technology of China
J
Jiaming Li
Shenzhen Institute of Advanced Technology, CAS
Yuansheng Ni
Yuansheng Ni
University of Waterloo
Artificial IntelligenceNatural Language ProcessingLarge Language Models
H
Haihong Wu
Shenzhen Institute of Advanced Technology, CAS; University of Science and Technology of China
Yaswanth Narsupalli
Yaswanth Narsupalli
B.Tech + M.Tech in AI and ML, IIT Kharagpur
Natural Language ProcessingLarge Language ModelsFoundational Models
Zhigang Zheng
Zhigang Zheng
Shenzhen Institute of Advanced Technology, CAS
C
Chengming Li
Shenzhen MSU-BIT University
Xiping Hu
Xiping Hu
Professor in Beijing Institute of Technology
Cyber-Physical SystemCrowd ComputingAffective Computing
Ruifeng Xu
Ruifeng Xu
Professor, Harbin Institute of Technology at Shenzhen
Natural Language ProcessingAffective ComputingArgumentation MiningLLMsBioinformatics
X
Xiaojun Chen
Shenzhen University
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding
J
Jiaheng Liu
Beihang University
Ruibo Liu
Ruibo Liu
RS @Google DeepMind
ASI
W
Wenhao Huang
01.ai
G
Ge Zhang
M-A-P; 01.ai; University of Waterloo
S
Shiwen Ni
Shenzhen Institute of Advanced Technology, CAS