Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing paragraph-level image descriptions generated by large vision-language models (VLMs) lack fine-grained, interpretable, and factuality-aware evaluation methods. Method: We introduce DOCCI-Critique—the first benchmark with sentence-level error-type classification and precise error localization—and propose VNLI-Critique, an automated evaluation model integrating vision-language natural language inference (VNLI), multi-stage supervised fine-tuning, and LLM-guided attributional explanation generation. We further design Critic-and-Revise, a closed-loop framework enabling end-to-end detection, explanation, and rewriting. Contribution/Results: VNLI-Critique achieves state-of-the-art performance on M-HalDetect; AutoRater attains a Spearman correlation of 0.98 with human judgments; Critic-and-Revise improves factual accuracy by 46% on DetailCaps-4870. This work establishes a new standard for granular, explainable, and actionable factuality assessment of VLM-generated image descriptions.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption
Problem

Research questions and friction points this paper is trying to address.

Evaluating factual accuracy of detailed VLM-generated image captions
Lack of fine-grained error detection in current caption evaluation methods
Need for automated tools to improve caption factuality and VLM understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

DOCCI-Critique benchmark with human annotations
VNLI-Critique model for factuality classification
Critic-and-Revise pipeline for caption improvement
🔎 Similar Papers
No similar papers found.