SpotEdit: Evaluating Visually-Guided Image Editing Methods

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual-guided image editing benchmarks oversimplify evaluation, neglecting hallucination—i.e., erroneous interpretation of visual cues leading to incorrect edits—in realistic scenarios. To address this, we propose SpotEdit, the first systematic benchmark for vision–text jointly guided image editing, covering diffusion, autoregressive, and hybrid generative models. Its contributions are threefold: (1) the first hallucination assessment module, quantifying misinterpretations by state-of-the-art multimodal models (e.g., GPT-4o) during editing; (2) a fine-grained test suite requiring complex semantic and spatial reasoning, integrating multimodal localization evaluation with dual human- and auto-evaluated metrics; and (3) empirical evidence revealing pervasive, previously underestimated hallucination across existing methods—performance gaps far exceeding prior understanding. SpotEdit’s code and data are publicly released, establishing foundational infrastructure for robust, hallucination-aware evaluation standards in image editing.

Technology Category

Application Category

📝 Abstract
Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visually-guided image editing methods comprehensively
Assessing performance disparities across diverse generative models
Addressing hallucination issues in visual cue interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for visually-guided editing evaluation
Systematic assessment across diverse generative model types
Dedicated hallucination analysis component for editing methods
🔎 Similar Papers
No similar papers found.