CompBench: Benchmarking Complex Instruction-guided Image Editing

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-guided image editing benchmarks oversimplify instructions, failing to reflect complex real-world scenarios. To address this, we introduce the first large-scale benchmark for complex instruction-guided image editing, covering four challenging task categories: fine-grained spatial localization, appearance modification, dynamic state changes, and object interaction. We propose an MLLM-human collaborative construction framework and a novel four-dimensional instruction disentanglement strategy—decoupling position, appearance, dynamics, and objects—to systematically model intricate editing intents for the first time. Our pipeline integrates structured instruction parsing, fine-grained scene synthesis, and multi-dimensional human verification. Experiments expose critical bottlenecks in mainstream models’ spatial reasoning and contextual understanding. This benchmark establishes a diagnostic evaluation standard and actionable optimization pathways toward next-generation image editing systems that are reasoning-capable, disentangled, and highly controllable. (149 words)

Technology Category

Application Category

📝 Abstract
While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack complex instruction-guided image editing tasks
Need for fine-grained instruction following and spatial reasoning in editing
Current models struggle with precise manipulation of complex editing scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-human collaborative framework for editing
Instruction decoupling into four key dimensions
Benchmark for complex instruction-guided editing
🔎 Similar Papers
No similar papers found.
Bohan Jia
Bohan Jia
East China Normal University
MLLMLLMAIGC
Wenxuan Huang
Wenxuan Huang
CUHK & ECNU
Artificial General IntelligenceMLLMLLMAIGCModel Acceleration
Y
Yuntian Tang
East China Normal University
J
Junbo Qiao
East China Normal University
Jincheng Liao
Jincheng Liao
ECNU
MLLM
Shaosheng Cao
Shaosheng Cao
Xiaohongshu, DiDi Chuxing, Ant Financial, Microsoft Research
LLMsMultimodal LLMsReinforcement LearningNLPGraph Neural Networks
F
Fei Zhao
Xiaohongshu Inc.
Z
Zhaopeng Feng
Zhejiang University
Zhouhong Gu
Zhouhong Gu
Fudan University
Language ModelingAutomated SocietyModel Editing
Zhenfei Yin
Zhenfei Yin
University of Oxford
Deep LearningMultimodalAI AgentRobotics
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery
W
Wanli Ouyang
The Chinese University of Hong Kong
L
Lin Chen
University of Science and Technology of China
Z
Zihan Wang
East China Normal University
Y
Yuan Xie
East China Normal University
S
Shaohui Lin
East China Normal University