IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-driven image editing evaluation has long suffered from poor alignment with human perception and a lack of high-quality, diverse benchmarks. To address this, we introduce IE-Bench—the first comprehensive evaluation benchmark featuring both broad task diversity and fine-grained human annotations (nearly 4,000 samples). We further propose IE-Critic-R1, the first evaluation model that explicitly incorporates interpretable human perceptual priors into its design. Built upon reinforcement learning from verifiable rewards (RLVR), IE-Critic-R1 jointly optimizes three key dimensions: text–image semantic alignment, edit region fidelity, and perceptual plausibility. Extensive experiments demonstrate that IE-Critic-R1 significantly outperforms existing metrics (e.g., CLIPScore, DINO-score) in subjective consistency with human judgment. IE-Bench is publicly released to foster reproducible, standardized evaluation in the field.

Technology Category

Application Category

📝 Abstract
Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.
Problem

Research questions and friction points this paper is trying to address.

Evaluating text-driven image editing quality accurately
Aligning automated assessment with human perception standards
Developing explainable metrics for text-image editing evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

IE-Bench benchmark with human-rated image editing samples
IE-Critic-R1 model for explainable quality assessment
Reinforcement Learning from Verifiable Rewards for human perception alignment
🔎 Similar Papers
No similar papers found.
Bowen Qu
Bowen Qu
Peking University, Ex: Rhymes.ai Aria Team
Multimodal learningVision-Language ModelsComputer Vision
S
Shangkun Sun
School of Electronic and Computer Engineering, Peking University
Xiaoyu Liang
Xiaoyu Liang
Tsinghua University
CO2 Conversion
W
Wei Gao
School of Electronic and Computer Engineering, Peking University