CompBench: Benchmarking Complex Instruction-guided Image Editing

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing instruction-guided image editing benchmarks oversimplify instructions, failing to reflect complex real-world scenarios. To address this, we introduce the first large-scale benchmark for complex instruction-guided image editing, covering four challenging task categories: fine-grained spatial localization, appearance modification, dynamic state changes, and object interaction. We propose an MLLM-human collaborative construction framework and a novel four-dimensional instruction disentanglement strategy—decoupling position, appearance, dynamics, and objects—to systematically model intricate editing intents for the first time. Our pipeline integrates structured instruction parsing, fine-grained scene synthesis, and multi-dimensional human verification. Experiments expose critical bottlenecks in mainstream models’ spatial reasoning and contextual understanding. This benchmark establishes a diagnostic evaluation standard and actionable optimization pathways toward next-generation image editing systems that are reasoning-capable, disentangled, and highly controllable. (149 words)

Technology Category

Application Category

📝 Abstract

While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack complex instruction-guided image editing tasks

Need for fine-grained instruction following and spatial reasoning in editing

Current models struggle with precise manipulation of complex editing scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-human collaborative framework for editing

Instruction decoupling into four key dimensions

Benchmark for complex instruction-guided editing

🔎 Similar Papers

Streamlining Image Editing with Layered Diffusion Brushes