Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of consistency across semantic scales in current image editing models, which limits their reliability assessment in professional settings. To this end, we introduce OmniIIE-Bench, a high-quality human-annotated benchmark, and propose the first dual-track diagnostic framework specifically designed to evaluate performance under varying semantic granularities. Our approach employs single-turn consistency and multi-turn coherence tasks, validated through a dual-review mechanism involving both professional designers and vision researchers, enabling precise identification of model degradation. Experiments quantitatively reveal, for the first time, a significant performance drop in mainstream instruction-based image editing models when transitioning from low- to high-level semantic edits, offering critical diagnostic insights for developing more robust and reliable next-generation editing systems.

Technology Category

Application Category

📝 Abstract
While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.
Problem

Research questions and friction points this paper is trying to address.

Instruction-based Image Editing
editing consistency
semantic scale
benchmark
performance gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-based image editing
editing consistency
semantic scale
benchmark design
multi-turn coordination
🔎 Similar Papers
No similar papers found.
Y
Yujia Yang
University of Chinese Academy of Sciences
Y
Yuanxiang Wang
University of Chinese Academy of Sciences
Z
Zhenyu Guan
University of Chinese Academy of Sciences
T
Tiankun Yang
University of Chinese Academy of Sciences
Chenxi Bao
Chenxi Bao
MBZUAI
Music GenerationInteractive Music DesignComputer Music
H
Haopeng Jin
Tencent
J
Jinwen Luo
Tencent
X
Xinyu Zuo
Tencent
L
Lisheng Duan
Tencent
H
Haijin Liang
Tencent
J
Jin Ma
Tencent
X
Xinming Wang
Tencent
R
Ruiwen Tao
Tencent
H
Hongzhu Yi
University of Chinese Academy of Sciences