Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Instruction-driven image editing models suffer from overfitting to annotation patterns during supervised fine-tuning, resulting in limited generalization; moreover, a universal, transferable reward model for evaluating editing quality remains lacking. Method: We propose Edit-R1, the first framework to employ a training-free multimodal large language model (MLLM) as a unified, zero-shot reward model, integrated with diffusion-negative-aware fine-tuning and a low-variance grouped filtering mechanism for stable and efficient policy optimization. The framework supports DiffusionNFT, high-order samplers, and implicit feedback, enabling plug-and-play upgrades of foundation models. Contribution/Results: Edit-R1 achieves state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench, significantly improving both generalization and editing fidelity of models including Qwen-Image-Edit and FLUX-Kontext.

Technology Category

Application Category

📝 Abstract
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves extbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.
Problem

Research questions and friction points this paper is trying to address.

Addresses overfitting in supervised image editing models
Introduces policy optimization for instruction-based image editing
Uses MLLM as training-free reward for diverse editing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Negative-aware Finetuning for likelihood-free policy optimization
MLLM as training-free reward model for fine-grained feedback
Low-variance group filtering mechanism to stabilize optimization
🔎 Similar Papers
No similar papers found.
Z
Zongjian Li
Shenzhen Graduate School, Peking University
Z
Zheyuan Liu
Shenzhen Graduate School, Peking University
Qihui Zhang
Qihui Zhang
Peking University
Human AlignmentMulti-ModalityLarge Language Model
B
Bin Lin
Shenzhen Graduate School, Peking University
S
Shenghai Yuan
Shenzhen Graduate School, Peking University
Z
Zhiyuan Yan
Shenzhen Graduate School, Peking University
Y
Yang Ye
Shenzhen Graduate School, Peking University
Wangbo Yu
Wangbo Yu
Peking University
3D VisionAIGC
Yuwei Niu
Yuwei Niu
Chongqing university
Visual RepresentationsLanguage Priors
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants