🤖 AI Summary
Existing instruction-guided image editing benchmarks oversimplify instructions, failing to reflect complex real-world scenarios. To address this, we introduce the first large-scale benchmark for complex instruction-guided image editing, covering four challenging task categories: fine-grained spatial localization, appearance modification, dynamic state changes, and object interaction. We propose an MLLM-human collaborative construction framework and a novel four-dimensional instruction disentanglement strategy—decoupling position, appearance, dynamics, and objects—to systematically model intricate editing intents for the first time. Our pipeline integrates structured instruction parsing, fine-grained scene synthesis, and multi-dimensional human verification. Experiments expose critical bottlenecks in mainstream models’ spatial reasoning and contextual understanding. This benchmark establishes a diagnostic evaluation standard and actionable optimization pathways toward next-generation image editing systems that are reasoning-capable, disentangled, and highly controllable. (149 words)
📝 Abstract
While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.