🤖 AI Summary
Current visual-guided image editing benchmarks oversimplify evaluation, neglecting hallucination—i.e., erroneous interpretation of visual cues leading to incorrect edits—in realistic scenarios. To address this, we propose SpotEdit, the first systematic benchmark for vision–text jointly guided image editing, covering diffusion, autoregressive, and hybrid generative models. Its contributions are threefold: (1) the first hallucination assessment module, quantifying misinterpretations by state-of-the-art multimodal models (e.g., GPT-4o) during editing; (2) a fine-grained test suite requiring complex semantic and spatial reasoning, integrating multimodal localization evaluation with dual human- and auto-evaluated metrics; and (3) empirical evidence revealing pervasive, previously underestimated hallucination across existing methods—performance gaps far exceeding prior understanding. SpotEdit’s code and data are publicly released, establishing foundational infrastructure for robust, hallucination-aware evaluation standards in image editing.
📝 Abstract
Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.