ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image (T2I) evaluation methods rely on single scalar scores, lacking interpretability and fine-grained diagnostic capability. To address this, we propose ImageDoctor—the first unified, multidimensional evaluation framework for image generation, quantifying performance across four dimensions: plausibility, semantic alignment, aesthetics, and holistic quality, while localizing defects via pixel-level saliency heatmaps. Its core innovation is grounded image reasoning, implemented through a “see-think-evaluate” paradigm that enables interpretable, fine-grained diagnosis. Built upon vision-language models, ImageDoctor is trained via joint supervised fine-tuning and reinforcement learning to support multi-task scoring and dense reward prediction. Experiments demonstrate strong agreement between its assessments and human preferences; when deployed as a reward model in preference optimization, it improves generation quality by 10%.

Technology Category

Application Category

📝 Abstract
The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a"look-think-predict"paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality -- achieving an improvement of 10% over scalar-based reward models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-aspect image quality beyond single scalar metrics
Providing pixel-level flaw indicators for interpretable feedback
Enhancing detail sensitivity via look-think-predict reasoning paradigm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-aspect evaluation framework for T2I models
Pixel-level heatmaps to indicate image flaws
Look-think-predict paradigm for detailed reasoning
Yuxiang Guo
Yuxiang Guo
Johns Hopskin University
Computer vision
J
Jiang Liu
AMD
Z
Ze Wang
AMD
H
Hao Chen
AMD
X
Ximeng Sun
AMD
Y
Yang Zhao
Johns Hopkins University
Jialian Wu
Jialian Wu
AMD GenAI
LLMComputer Vision
X
Xiaodong Yu
AMD
Z
Zicheng Liu
AMD
Emad Barsoum
Emad Barsoum
AMD, Columbia University
Generative AIFoundation ModelsAgentic AIComputer VisionML Frameworks