🤖 AI Summary
This work addresses two key challenges in image editing: insufficient modeling of editing transformations and low correlation between automated evaluation metrics and human judgments. To this end, we propose the Editing-Invariant Embedding (EIE) paradigm. Methodologically, we design a dual-branch contrastive learning encoder that jointly encodes original images and their edited counterparts. Building upon the CLIP architecture, we introduce an editing-aware alignment loss to achieve unified semantic alignment across image pairs and cross-modal (textual) representations. The resulting embedding space directly supports exemplar-based editing and text-free automatic assessment of editing quality. Experiments demonstrate that our approach surpasses state-of-the-art methods on exemplar-based editing tasks; moreover, its automatic evaluation metric achieves significantly higher correlation with human ratings while improving inference efficiency by 40%.
📝 Abstract
We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.