🤖 AI Summary
This work addresses the lack of systematic evaluation of consistency across semantic scales in current image editing models, which limits their reliability assessment in professional settings. To this end, we introduce OmniIIE-Bench, a high-quality human-annotated benchmark, and propose the first dual-track diagnostic framework specifically designed to evaluate performance under varying semantic granularities. Our approach employs single-turn consistency and multi-turn coherence tasks, validated through a dual-review mechanism involving both professional designers and vision researchers, enabling precise identification of model degradation. Experiments quantitatively reveal, for the first time, a significant performance drop in mainstream instruction-based image editing models when transitioning from low- to high-level semantic edits, offering critical diagnostic insights for developing more robust and reliable next-generation editing systems.
📝 Abstract
While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.