REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Text-to-image (T2I) generation lacks fine-grained, interpretable, and human-preference-aligned evaluation methods. Method: We propose a unified evaluation framework grounded in reinforcement-guided visual reasoning, introducing the novel “localize–reason–judge” paradigm to enable interpretable, element-level alignment quantification. Our approach integrates multimodal large language models (MLLMs) with visual grounding, designs a structured reward function, and introduces Group Relative Policy Optimization (GRPO)—a novel algorithm that jointly optimizes format compliance, localization accuracy, and alignment fidelity. Results: Our framework achieves state-of-the-art performance across four major benchmarks—EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench—significantly outperforming both proprietary models and supervised baselines. Moreover, it attains higher inference efficiency than iterative visual reasoning methods, enabling scalable, principled T2I evaluation.

Technology Category

Application Category

📝 Abstract

Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.

Problem

Research questions and friction points this paper is trying to address.

Evaluates element-level text-image alignment for T2I models

Addresses coarse-grained metrics lacking fine-grained interpretability

Proposes reinforcement-guided visual reasoning for interpretable judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement-guided visual reasoning for fine-grained alignment

Group Relative Policy Optimization with composite reward function

Structured grounding-reasoning-conclusion paradigm using MLLMs

🔎 Similar Papers

No similar papers found.