🤖 AI Summary
Existing evaluation frameworks for image editing suffer from limited task coverage and inadequate metrics for assessing identity, structural, and semantic consistency in edited images. To address these shortcomings, this work introduces VCReward-Bench, a comprehensive benchmark comprising 1,200 real user queries spanning 23 editing tasks and enabling out-of-distribution evaluation. Furthermore, the authors propose PVC-Judge, an evaluation model that leverages region-disentangled preference data synthesis and pairwise visual consistency modeling. Experimental results demonstrate that PVC-Judge achieves state-of-the-art performance among open-source models, surpassing GPT-5.1 on average. A systematic evaluation of 16 state-of-the-art image editing models reveals critical limitations of current approaches, offering both a reliable assessment tool and actionable directions for advancing precise image editing.
📝 Abstract
Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.