GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenges of semantic ambiguity and lack of fine-grained evaluation in benchmarking text-guided image editing models, introducing the first grounded evaluation benchmark. Methodologically, it proposes a dual-dimension assessment framework: (1) functional correctness is evaluated via task-driven, automatically generated multiple-choice questions to measure instruction adherence; (2) content fidelity is quantified using object-aware spatial masks and CLIP-guided local consistency scoring—mitigating the semantic coarseness of global CLIP similarity. Key contributions include the first automatic question-answering verification mechanism and object-level fidelity quantification, uncovering a fundamental trade-off between instruction accuracy and unintended modifications to non-target regions. Evaluated on 1,000+ samples across 20 content categories, the benchmark achieves high correlation with human judgments (Spearman’s ρ > 0.87). While GPT-Image-1 attains the highest instruction accuracy, it exhibits significantly lower fidelity compared to other models.

Technology Category

Application Category

📝 Abstract
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
Problem

Research questions and friction points this paper is trying to address.

Evaluating text-guided image editing models lacks precision metrics
Assessing functional correctness and content preservation in image edits
Developing a scalable benchmark for accurate model comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark for text-guided image editing evaluation
Uses multiple-choice questions for functional correctness
Employs object-aware masking for content preservation