Learning an Image Editing Model without Image Editing Pairs

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image editing models have long relied on large-scale paired image–text datasets, yet high-quality naturally collected input–target image pairs are scarce. This work introduces the first end-to-end training paradigm that operates entirely without paired data: driven by natural language instructions, it leverages vision-language models (VLMs) to provide fine-grained gradient feedback, and integrates a few-step diffusion model rollout with a distribution-matching loss (DMD) to constrain generated images within the pretrained diffusion manifold—thereby preserving both semantic consistency and visual fidelity. By eliminating reliance on synthetic or real paired data, the method avoids error accumulation inherent in data synthesis pipelines. On standard benchmarks, it matches the performance of fully supervised diffusion-based editors and significantly outperforms reinforcement learning approaches such as Flow-GRPO under few-step inference, all while requiring no ground-truth or synthetic image pairs.

Technology Category

Application Category

📝 Abstract
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for supervised image editing pairs during training
Uses vision-language model feedback for direct gradient optimization
Maintains visual fidelity through distribution matching loss constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training diffusion models without paired data
Using VLM feedback for end-to-end optimization
Incorporating distribution matching loss for fidelity
🔎 Similar Papers
No similar papers found.