Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing evaluation metrics for fine-grained image captioning suffer from obsolescence and coarse-grained human annotations, limiting reliable assessment of descriptive fidelity. Method: We introduce DeCapBench—the first dedicated benchmark for fine-grained caption evaluation—and DCScore, a novel metric that decomposes captions into atomic information units to independently quantify hallucination rate and content completeness. We further propose FeedQuill, an automated method for high-quality preference data collection, and leverage DCScore for alignment learning and preference optimization. Contribution/Results: Experiments show DeCapBench exhibits strong correlation (ρ > 0.92) with VLM Arena rankings, significantly outperforming prior vision-language benchmarks. Our approach substantially reduces hallucination rates and achieves state-of-the-art performance across multiple fine-grained captioning benchmarks, surpassing GPT-4o consistently.

Technology Category

Application Category

📝 Abstract

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Develops DeCapBench and DCScore for detailed image captioning evaluation.

Introduces FeedQuill for fine-grained feedback collection and preference optimization.

Reduces hallucinations and improves detail captioning in vision-language models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DeCapBench for detailed captioning evaluation

Develops DCScore metric for fine-grained caption assessment

Presents FeedQuill for automatic feedback and preference optimization

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis