SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-driven image editing methods suffer from semantic misalignment due to reliance on automatically constructed instruction–image pairs, resulting in noisy supervision signals. To address this, we propose a dual-mechanism framework for improving supervision quality. First, we introduce a novel generation-step-level prior to uniformly guide a vision-language model (VLM) for instruction refinement, enabling fine-grained semantic alignment. Second, we construct positive–negative instruction triplets and incorporate contrastive learning—requiring no additional pretraining or VLM inference—to enhance instruction discriminability. Our method integrates instruction rewriting, attribute-level analysis, and triplet loss optimization. Evaluated on the Real-Edit benchmark, our approach surpasses the state-of-the-art (SmartEdit) by 9.19% while using only 1/30 the training data and 1/13 the model size.

Technology Category

Application Category

📝 Abstract
Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.
Problem

Research questions and friction points this paper is trying to address.

Addressing noisy supervision in instruction-based image editing datasets
Improving alignment between editing instructions and image pairs
Enhancing supervision with contrastive instructions and triplet loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectify editing instructions for better alignment
Use contrastive supervision with triplet loss
Leverage prior attributes for VLM guidance
🔎 Similar Papers
No similar papers found.
M
Ming Li
ByteDance Intelligent Creation (USA), Center for Research in Computer Vision, University of Central Florida
X
Xin Gu
ByteDance Intelligent Creation (USA)
F
Fan Chen
ByteDance Intelligent Creation (USA)
Xiaoying Xing
Xiaoying Xing
Northwestern
computer visionmachine learningmultimodality
Longyin Wen
Longyin Wen
Bytedance Inc.
Artificial IntelligenceComputer VisionMachine Learning
C
Chen Chen
Center for Research in Computer Vision, University of Central Florida
Sijie Zhu
Sijie Zhu
Unknown affiliation