🤖 AI Summary
This work addresses the limitations of existing example-based image editing methods, which rely on paired dual-image supervision and thus suffer from poor scalability and generalization. The authors propose Delta-Adapter, the first approach capable of performing exemplar-guided editing with only a single source–target image pair for supervision. It leverages a pretrained vision encoder to extract the semantic delta—the semantic difference—between source and target images, and injects this information into a pretrained editing model via a lightweight Perceiver-based adapter, enabling accurate editing without direct access to the target image during inference. A semantic delta consistency loss is introduced to enhance transformation fidelity, and the method effectively leverages large-scale existing editing datasets. Experiments demonstrate that Delta-Adapter significantly outperforms four strong baselines on both seen and unseen editing tasks, achieving superior accuracy and content consistency.
📝 Abstract
Exemplar-based image editing applies a transformation defined by a source-target image pair to a new query image. Existing methods rely on a pair-of-pairs supervision paradigm, requiring two image pairs sharing the same edit semantics to learn the target transformation. This constraint makes training data difficult to curate at scale and limits generalization across diverse edit types. We propose Delta-Adapter, a method that learns transferable editing semantics under single-pair supervision, requiring no textual guidance. Rather than directly exposing the exemplar pair to the model, we leverage a pre-trained vision encoder to extract a semantic delta that encodes the visual transformation between the two images. This semantic delta is injected into a pre-trained image editing model via a Perceiver-based adapter. Since the target image is never directly visible to the model, it can serve as the prediction target, enabling single-pair supervision without requiring additional exemplar pairs. This formulation allows us to leverage existing large-scale editing datasets for training. To further promote faithful transformation transfer, we introduce a semantic delta consistency loss that aligns the semantic change of the generated output with the ground-truth semantic delta extracted from the exemplar pair. Extensive experiments demonstrate that Delta-Adapter consistently improves both editing accuracy and content consistency over four strong baselines on seen editing tasks, while also generalizing more effectively to unseen editing tasks. Code will be available at https://delta-adapter.github.io.