🤖 AI Summary
Existing image editing benchmarks primarily focus on explicit instructions, making them inadequate for evaluating models’ ability to interpret abstract intents such as “mood” or “atmosphere.” This work formalizes the abstract image editing task for the first time and introduces AbstractEdit, the first real-world benchmark for this setting. Furthermore, it proposes Entity-Rubrics, a fine-grained evaluation framework grounded in atomic entity analysis, which integrates a large language model (LLM) text encoder, iterative reasoning, and multi-model comparative assessment. This framework can serve as the foundation for reward modeling or test-time critique mechanisms. Experiments across eleven state-of-the-art models reveal pervasive under-editing or over-editing issues, while advanced LLM-based encoding combined with iterative reasoning significantly improves performance.
📝 Abstract
Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.