$ exttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image editing models lack evaluation benchmarks with controllable instruction complexity. Method: We introduce the first benchmark enabling gradient-level instruction complexity control—Chain-of-Edit—generated via GPT-4o to synthesize hierarchical atomic and compositional editing instructions; propose a quantifiable, tunable complexity definition and evaluation paradigm; and design a VLM-driven fully automated assessment pipeline incorporating Best-of-N sampling for robustness. Results: Experiments uncover critical phenomena including the “synthetic data curse,” demonstrate a precipitous performance drop of open-source models under high-complexity edits, and reveal that stepwise editing significantly degrades both fidelity and aesthetic quality. Our framework provides a reproducible, large-scale, multi-dimensional evaluation suite—measuring semantic consistency, structural fidelity, and visual quality—establishing a new standard for systematic assessment of instruction-driven image editing.

Technology Category

Application Category

📝 Abstract
We introduce $ exttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.
Problem

Research questions and friction points this paper is trying to address.

Evaluates image editing models with varying instruction complexity
Assesses performance gaps between open-source and proprietary models
Explores impact of synthetic data on edited image quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4o for diverse instruction collection
Chain-of-Edit pipeline for complex instructions
VLM-based auto-evaluation for large-scale assessment
🔎 Similar Papers
No similar papers found.