CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two fundamental questions: (1) whether modern vision-language models (VLMs) achieve human-level performance in fine-grained image captioning, and (2) whether existing automated evaluation metrics reliably align with human preferences. To this end, we introduce CapArena—the first arena-style human preference benchmark for captioning—built upon 6,000+ crowd-sourced pairwise adversarial caption annotations. CapArena enables systematic evaluation of state-of-the-art VLMs and quantifies metric-human alignment. We further propose a novel “VLM-as-a-Judge” two-tier evaluation paradigm, which substantially outperforms conventional metrics (e.g., METEOR, SPICE). Leveraging these insights, we release CapArena-Auto, a lightweight, efficient automated benchmark achieving 94.3% rank correlation with human judgments at a mere $4 per evaluation. All data, annotations, and evaluation tools are publicly open-sourced.

Technology Category

Application Category

📝 Abstract
Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' image captioning performance vs humans
Assessing reliability of automated metrics for caption quality
Developing CapArena-Auto for efficient captioning benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

CapArena platform for caption evaluation
VLM-as-a-Judge for robust caption assessment
CapArena-Auto benchmark with high human correlation
🔎 Similar Papers
No similar papers found.
Kanzhi Cheng
Kanzhi Cheng
Nanjing University Ph.D Student
Vision-Language ModelsAI AgentsImage Captioning
W
Wenpo Song
National Key Laboratory for Novel Software Technology, Nanjing University
J
Jiaxin Fan
National Key Laboratory for Novel Software Technology, Nanjing University
Z
Zheng Ma
National Key Laboratory for Novel Software Technology, Nanjing University
Qiushi Sun
Qiushi Sun
The University of Hong Kong, National University of Singapore
Natural Language ProcessingAgentsCode Intelligence
Fangzhi Xu
Fangzhi Xu
Xi'an Jiaotong University | Nanyang Technological University
Large Language ModelsSelf-TrainingReasoningGUI Agents
C
Chenyang Yan
National Key Laboratory for Novel Software Technology, Nanjing University
N
Nuo Chen
National Key Laboratory for Novel Software Technology, Nanjing University
Jianbing Zhang
Jianbing Zhang
Associate Professor, Nanjing University
pre-training modelmulti-modalimage captioningnatural language processingdata mining
J
Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University