PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing image caption evaluation metrics (e.g., CIDEr, SPICE) are designed for short captions and lack the granularity to detect fine-grained errors—particularly in attribute assignment and relational structure—within long, complex descriptions. To address this, we propose PoSh, a fine-grained evaluation method grounded in scene graph representations: it leverages structured, interpretable, and alignable scene graphs to guide large language models (LLMs) in token- or phrase-level error identification. PoSh integrates vision-language understanding, scene graph parsing, and an LLM-as-a-judge framework, and introduces DOCENT—a manually annotated, fine-grained caption evaluation dataset. Experiments show PoSh achieves a +0.05 improvement in Spearman correlation over open-source baselines on DOCENT, substantially outperforms GPT-4o-as-a-Judge, exhibits robustness across diverse image types, and serves effectively as a reward signal in reinforcement learning. Moreover, PoSh exposes critical limitations of current multimodal foundation models in generating accurate, compositional scene descriptions.

Technology Category

Application Category

📝 Abstract

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating detailed image descriptions using scene graphs

Addressing limitations of standard metrics for long texts

Improving interpretability and human correlation in VLM assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses scene graphs as structured rubrics

Guides LLMs-as-a-Judge for evaluation

Produces aggregate scores from fine-grained errors

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling