DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing instruction-driven image editing models lack systematic evaluation of their capability to perform fine-grained edits on small-scale objects, defined as occupying 1%–10% of the image area. To address this gap, this work introduces DLEBench, the first benchmark specifically designed for small-object editing, comprising 1,889 samples and seven instruction categories. The authors propose a dual-mode evaluation framework—combining Tool-driven and Oracle-guided assessment—along with fine-grained scoring criteria that substantially reduce subjectivity and better align large model outputs with human judgments. Experiments across ten state-of-the-art models reveal significant performance deficiencies in small-object editing tasks, thereby demonstrating the necessity and effectiveness of the proposed benchmark.

Technology Category

Application Category

📝 Abstract

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

Problem

Research questions and friction points this paper is trying to address.

small-scale object editing

instruction-based image editing

benchmark evaluation

visual consistency

instruction following

Innovation

Methods, ideas, or system contributions that make the work stand out.

small-scale object editing

instruction-based image editing

benchmark