REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) generation lacks fine-grained, interpretable, and human-preference-aligned evaluation methods. Method: We propose a unified evaluation framework grounded in reinforcement-guided visual reasoning, introducing the novel “localize–reason–judge” paradigm to enable interpretable, element-level alignment quantification. Our approach integrates multimodal large language models (MLLMs) with visual grounding, designs a structured reward function, and introduces Group Relative Policy Optimization (GRPO)—a novel algorithm that jointly optimizes format compliance, localization accuracy, and alignment fidelity. Results: Our framework achieves state-of-the-art performance across four major benchmarks—EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench—significantly outperforming both proprietary models and supervised baselines. Moreover, it attains higher inference efficiency than iterative visual reasoning methods, enabling scalable, principled T2I evaluation.

Technology Category

Application Category

📝 Abstract
Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.
Problem

Research questions and friction points this paper is trying to address.

Evaluates element-level text-image alignment for T2I models
Addresses coarse-grained metrics lacking fine-grained interpretability
Proposes reinforcement-guided visual reasoning for interpretable judgments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement-guided visual reasoning for fine-grained alignment
Group Relative Policy Optimization with composite reward function
Structured grounding-reasoning-conclusion paradigm using MLLMs
🔎 Similar Papers
No similar papers found.
F
Fulin Shi
Zhejiang University
Wenyi Xiao
Wenyi Xiao
Zhejiang University
B
Bin Chen
Alibaba Group
L
Liang Din
Alibaba Group
Leilei Gan
Leilei Gan
Zhejiang University
NLPLLMsMultimodal LLMsAI+X