Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Existing paragraph-level image descriptions generated by large vision-language models (VLMs) lack fine-grained, interpretable, and factuality-aware evaluation methods. Method: We introduce DOCCI-Critique—the first benchmark with sentence-level error-type classification and precise error localization—and propose VNLI-Critique, an automated evaluation model integrating vision-language natural language inference (VNLI), multi-stage supervised fine-tuning, and LLM-guided attributional explanation generation. We further design Critic-and-Revise, a closed-loop framework enabling end-to-end detection, explanation, and rewriting. Contribution/Results: VNLI-Critique achieves state-of-the-art performance on M-HalDetect; AutoRater attains a Spearman correlation of 0.98 with human judgments; Critic-and-Revise improves factual accuracy by 46% on DetailCaps-4870. This work establishes a new standard for granular, explainable, and actionable factuality assessment of VLM-generated image descriptions.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

Problem

Research questions and friction points this paper is trying to address.

Evaluating factual accuracy of detailed VLM-generated image captions

Lack of fine-grained error detection in current caption evaluation methods

Need for automated tools to improve caption factuality and VLM understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

DOCCI-Critique benchmark with human annotations

VNLI-Critique model for factuality classification

Critic-and-Revise pipeline for caption improvement

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis