SpotEdit: Evaluating Visually-Guided Image Editing Methods

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current visual-guided image editing benchmarks oversimplify evaluation, neglecting hallucination—i.e., erroneous interpretation of visual cues leading to incorrect edits—in realistic scenarios. To address this, we propose SpotEdit, the first systematic benchmark for vision–text jointly guided image editing, covering diffusion, autoregressive, and hybrid generative models. Its contributions are threefold: (1) the first hallucination assessment module, quantifying misinterpretations by state-of-the-art multimodal models (e.g., GPT-4o) during editing; (2) a fine-grained test suite requiring complex semantic and spatial reasoning, integrating multimodal localization evaluation with dual human- and auto-evaluated metrics; and (3) empirical evidence revealing pervasive, previously underestimated hallucination across existing methods—performance gaps far exceeding prior understanding. SpotEdit’s code and data are publicly released, establishing foundational infrastructure for robust, hallucination-aware evaluation standards in image editing.

Technology Category

Application Category

📝 Abstract

Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visually-guided image editing methods comprehensively

Assessing performance disparities across diverse generative models

Addressing hallucination issues in visual cue interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for visually-guided editing evaluation

Systematic assessment across diverse generative model types

Dedicated hallucination analysis component for editing methods

🔎 Similar Papers

Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit

2024-08-19arXiv.orgCitations: 1

Bosch Group

Renningen, BW, DE

PhD - Effiziente Neuronale Repräsentation von Datensätzen

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)