🤖 AI Summary
This work addresses the critical deficiency of multimodal large language models (MLLMs) in comprehending high-level implicit image semantics—such as underlying intent, abstract concepts, and sentiment polarity—by introducing II-Bench, the first dedicated benchmark for this capability. II-Bench comprises human-crafted, multidimensional image–question pairs spanning three core tasks: semantic reasoning, affective attribution, and contextual inference, and supports prompt-sensitivity analysis. Evaluating 12 state-of-the-art MLLMs reveals a maximum accuracy of only 74.8%, substantially below human performance (mean 90%, peak 98%), exposing systematic limitations in abstraction, fine-grained visual grounding, and internal affective modeling. Notably, augmenting inputs with sentiment-aware prompts improves performance across most models, further diagnosing their deficient affective representation. This study is the first to formally define, operationalize, and quantitatively assess implicit image understanding in MLLMs, establishing a foundational benchmark and diagnostic toolkit for future research.
📝 Abstract
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.