EditCLIP: Representation Learning for Image Editing

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses two key challenges in image editing: insufficient modeling of editing transformations and low correlation between automated evaluation metrics and human judgments. To this end, we propose the Editing-Invariant Embedding (EIE) paradigm. Methodologically, we design a dual-branch contrastive learning encoder that jointly encodes original images and their edited counterparts. Building upon the CLIP architecture, we introduce an editing-aware alignment loss to achieve unified semantic alignment across image pairs and cross-modal (textual) representations. The resulting embedding space directly supports exemplar-based editing and text-free automatic assessment of editing quality. Experiments demonstrate that our approach surpasses state-of-the-art methods on exemplar-based editing tasks; moreover, its automatic evaluation metric achieves significantly higher correlation with human ratings while improving inference efficiency by 40%.

Technology Category

Application Category

📝 Abstract

We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.

Problem

Research questions and friction points this paper is trying to address.

Learning unified representations for image editing transformations

Enhancing exemplar-based image editing with reference embeddings

Automating edit evaluation using similarity metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly encodes image and edited counterpart

Replaces text instructions with exemplar embeddings

Measures edit similarity via embedding comparisons

🔎 Similar Papers

No similar papers found.

Authors to Follow