CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses two fundamental questions: (1) whether modern vision-language models (VLMs) achieve human-level performance in fine-grained image captioning, and (2) whether existing automated evaluation metrics reliably align with human preferences. To this end, we introduce CapArena—the first arena-style human preference benchmark for captioning—built upon 6,000+ crowd-sourced pairwise adversarial caption annotations. CapArena enables systematic evaluation of state-of-the-art VLMs and quantifies metric-human alignment. We further propose a novel “VLM-as-a-Judge” two-tier evaluation paradigm, which substantially outperforms conventional metrics (e.g., METEOR, SPICE). Leveraging these insights, we release CapArena-Auto, a lightweight, efficient automated benchmark achieving 94.3% rank correlation with human judgments at a mere $4 per evaluation. All data, annotations, and evaluation tools are publicly open-sourced.

Technology Category

Application Category

📝 Abstract

Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' image captioning performance vs humans

Assessing reliability of automated metrics for caption quality

Developing CapArena-Auto for efficient captioning benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

CapArena platform for caption evaluation

VLM-as-a-Judge for robust caption assessment

CapArena-Auto benchmark with high human correlation

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis