ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image captioning evaluation benchmarks suffer from limitations in description length diversity, coverage of state-of-the-art multimodal large language models (MLLMs), and reliance on human annotations, hindering comprehensive assessment of modern MLLMs’ generative capabilities. To address this, this work introduces ICBench, a large-scale image captioning benchmark comprising 2,000 images across 12 content categories and 40,000 captions—both short and long—generated by 10 advanced MLLMs, accompanied by fine-grained human subjective ratings yielding mean opinion scores (MOS). The study further proposes ITIScore, a novel automatic evaluation metric based on image–text–image reconstruction consistency, marking the first application of this mechanism to caption quality assessment. Experiments demonstrate that ITIScore correlates strongly with human judgments and exhibits strong zero-shot generalization across multiple public datasets. Both ICBench and ITIScore will be publicly released.
📝 Abstract
Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.
Problem

Research questions and friction points this paper is trying to address.

image captioning
multimodal large language models
evaluation benchmark
human annotation
caption diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

ITIScore
image captioning
multimodal large language models
reconstruction consistency
human evaluation
🔎 Similar Papers
No similar papers found.