Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current AI image evaluation predominantly relies on single scalar quality scores, lacking fine-grained diagnostic capabilities for critical dimensions such as authenticity and plausibility—thus impeding targeted model optimization. To address this, we introduce Q-Real, the first fine-grained evaluation dataset explicitly designed for these two dimensions, comprising 3,088 images annotated with object localization, authenticity/plausibility judgment queries, and attribute descriptions. Based on Q-Real, we propose Q-Real Bench, a novel benchmark integrating visual grounding with multi-step reasoning question answering. We further design an MLLM-adapted fine-tuning framework that synergizes human annotations with automated reasoning. Experiments demonstrate substantial improvements in MLLMs’ performance across authenticity assessment, plausibility analysis, and entity localization—yielding interpretable, actionable feedback for generative model refinement.

Technology Category

Application Category

📝 Abstract

Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

Evaluating realism and plausibility in AI-generated images

Providing fine-grained quality assessment for generative models

Developing benchmark for multi-modal language model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for fine-grained realism plausibility evaluation

Benchmark tests multimodal models judgment grounding tasks

Framework fine-tunes models using annotated entity location data

🔎 Similar Papers

No similar papers found.