RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current instruction-based image editing methods exhibit significant performance degradation on complex scenes involving multiple entities, particularly struggling with referring expression understanding. This work identifies systematic deficiencies in mainstream models for multi-entity referring editing tasks. To address this, we introduce RefEdit-Bench—the first high-quality, referring-expression-oriented benchmark—and propose RefEdit, a lightweight and efficient model. RefEdit leverages a scalable synthetic data generation pipeline, built upon RefCOCO annotations and adapted to diffusion models (Flux/SD3), achieving superior performance with only 20K training samples—outperforming million-scale baseline methods. On RefEdit-Bench, RefEdit substantially surpasses leading closed- and open-source approaches; it also attains state-of-the-art results on conventional image editing benchmarks. All code, data, and model weights are publicly released.

Technology Category

Application Category

📝 Abstract

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data &checkpoint for reproducibility.

Problem

Research questions and friction points this paper is trying to address.

Improving image editing models for complex multi-entity scenes

Addressing poor performance on referring expression benchmarks

Enhancing instruction-based editing with limited training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RefEdit-Bench benchmark for complex scenes

Uses scalable synthetic data generation pipeline

Outperforms baselines with fewer training samples

🔎 Similar Papers

No similar papers found.