Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-driven image editing methods suffer from semantic misalignment between synthetically generated training data and natural language instructions, leading to inconsistent editing outcomes. To address this, we propose a contrastive learning-based self-supervised data refinement framework. Our approach is the first to model fine-grained semantic changes before and after editing at each denoising timestep in the diffusion latent space, enabling precise instruction alignment. We further design an instruction-conditioned loss that jointly optimizes cross-modal text–image consistency and editing-direction fidelity. Applying our framework to the InstructPix2Pix dataset, we refine over 120K high-quality instruction–image–edit triplets. Fine-tuning diffusion models on this refined dataset yields substantial improvements in instruction following. The code and refined dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.
Problem

Research questions and friction points this paper is trying to address.

Improving instruction-guided image editing quality
Aligning image pairs with edit instructions
Handling noisy latent images in diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised refinement of image-text alignment
Handles noisy latent images and diffusion timesteps
Novel loss function for instruction-guided editing
🔎 Similar Papers
No similar papers found.