🤖 AI Summary
Existing paragraph-level image descriptions generated by large vision-language models (VLMs) lack fine-grained, interpretable, and factuality-aware evaluation methods.
Method: We introduce DOCCI-Critique—the first benchmark with sentence-level error-type classification and precise error localization—and propose VNLI-Critique, an automated evaluation model integrating vision-language natural language inference (VNLI), multi-stage supervised fine-tuning, and LLM-guided attributional explanation generation. We further design Critic-and-Revise, a closed-loop framework enabling end-to-end detection, explanation, and rewriting.
Contribution/Results: VNLI-Critique achieves state-of-the-art performance on M-HalDetect; AutoRater attains a Spearman correlation of 0.98 with human judgments; Critic-and-Revise improves factual accuracy by 46% on DetailCaps-4870. This work establishes a new standard for granular, explainable, and actionable factuality assessment of VLM-generated image descriptions.
📝 Abstract
Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption