🤖 AI Summary
Existing instruction-guided image editing methods rely on task-specific annotations, segmentation masks, or model fine-tuning, limiting their generalizability and deployment efficiency. This paper introduces the first fully unsupervised, language-guided image editing framework—requiring no annotations, masks, or model adaptation—enabling zero-shot, zero-training, zero-mask “plug-and-play” editing. Our approach synergistically leverages pre-trained multimodal large models (CLIP and a diffusion prior) and employs gradient-driven latent-space inversion to autonomously discover semantically consistent editing trajectories under text-image alignment constraints. Evaluated across multiple benchmarks, our method achieves state-of-the-art performance. Comprehensive qualitative and quantitative analyses demonstrate superior editing accuracy, image fidelity, and output diversity compared to supervised alternatives.
📝 Abstract
Instruction-guided image editing consists in taking an image and an instruction and deliverring that image altered according to that instruction. State-of-the-art approaches to this task suffer from the typical scaling up and domain adaptation hindrances related to supervision as they eventually resort to some kind of task-specific labelling, masking or training. We propose a novel approach that does without any such task-specific supervision and offers thus a better potential for improvement. Its assessment demonstrates that it is highly effective, achieving very competitive performance.