🤖 AI Summary
To address the ambiguity and expressive limitations of textual descriptions in image editing, this paper proposes an optimization-free, end-to-end example-based image editing method. Our approach leverages pre-trained text-to-image diffusion models and multimodal vision-language models (VLMs) to directly infer editing intentions from input–output image pairs and transfer them to target images. The key contribution lies in eliminating conventional fine-tuning or iterative optimization: instead, we employ a VLM to explicitly model cross-image editing relationships, enabling plug-and-play intention transfer. Extensive experiments across diverse editing tasks—including attribute replacement, style transfer, and structural deformation—demonstrate that our method significantly outperforms existing baselines. It achieves approximately 4× faster inference speed while maintaining strong generalization capability and high visual fidelity. This work establishes a new paradigm for efficient, intuitive, and semantically precise image editing.
📝 Abstract
Text-to-Image Diffusion models have enabled a wide array of image editing applications. However, capturing all types of edits through text alone can be challenging and cumbersome. The ambiguous nature of certain image edits is better expressed through an exemplar pair, i.e., a pair of images depicting an image before and after an edit respectively. In this work, we tackle exemplar-based image editing -- the task of transferring an edit from an exemplar pair to a content image(s), by leveraging pretrained text-to-image diffusion models and multimodal VLMs. Even though our end-to-end pipeline is optimization-free, our experiments demonstrate that it still outperforms baselines on multiple types of edits while being ~4x faster.