🤖 AI Summary
To address low-quality training data in instruction-driven image editing—characterized by poor instruction adherence, detail loss, and prominent artifacts—this paper proposes a multi-dimensional reward supervision paradigm that eliminates reliance on high-fidelity ground-truth images. We introduce the first GPT-4o–based quantitative reward assessment framework, evaluating edits along three dimensions: instruction adherence, detail fidelity, and generation quality. We further construct RewardEdit20K, the first large-scale reward dataset featuring fine-grained textual feedback. Additionally, we design a multi-reward embedding conditional training framework that jointly modulates both the latent space and the U-Net architecture. Our method achieves significant improvements over InsPix2Pix and SmartEdit on the Real-Edit benchmark. The code is publicly released, empirically validating that multi-reward conditional modeling substantially enhances editing robustness and fidelity.
📝 Abstract
High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in $0sim 5$ and text descriptive feedback on the specific failure points in ground-truth edited images, resulting in a high-quality editing reward dataset, i.e., RewardEdit20K. 2) We further proposed a novel training framework to seamlessly integrate the metric output, regarded as multi-reward, into editing models to learn from the imperfect training triplets. During training, the reward scores and text descriptions are encoded as embeddings and fed into both the latent space and the U-Net of the editing models as auxiliary conditions. 3) We also build a challenging evaluation benchmark with real-world images/photos and diverse editing instructions, named Real-Edit. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., InsPix2Pix and SmartEdit. Code is released at https://github.com/bytedance/Multi-Reward-Editing.