FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the limitations of existing diffusion-based image editing methods, which rely solely on text prompts and often fail to precisely localize edits, resulting in poor background consistency. To overcome this, we propose FineEdit, a novel fine-grained editing framework that uniquely integrates bounding box guidance with multi-level spatial conditioning to enable accurate local modifications while preserving the unedited background. To support this approach, we introduce FineEdit-1.2M, a large-scale dataset comprising 1.2 million precisely annotated image-edit pairs, along with FineEdit-Bench, a dedicated evaluation benchmark. Experimental results demonstrate that FineEdit significantly outperforms open-source models such as Qwen-Image-Edit and LongCat-Image-Edit in both instruction following and background preservation, and exhibits strong generalization capabilities on established benchmarks including GEdit and ImgEdit Bench.

Technology Category

Application Category

📝 Abstract

Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

Problem

Research questions and friction points this paper is trying to address.

image editing

object localization

background consistency

diffusion models

spatial guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

bounding box guidance

fine-grained image editing

diffusion models