SINE: SINgle Image Editing with Text-to-Image Diffusion Models

📅 2022-12-08
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 145
Influential: 13
📄 PDF
🤖 AI Summary
This work addresses fine-grained, text-guided editing of a single real-world image using pretrained diffusion models—challenged by overfitting, semantic distortion, and resolution constraints during single-image fine-tuning. To tackle these issues, we propose a model-driven, classifier-free guidance mechanism that enhances controllability of linguistic instructions throughout the generative process. Additionally, we introduce a patch-based fine-tuning strategy enabling efficient knowledge distillation from a single image and supporting arbitrary-resolution editing. Our approach preserves the original image’s structural integrity and semantic consistency while significantly improving editing fidelity and quality across diverse tasks—including style transfer, content addition, and object manipulation. Extensive experiments demonstrate superior performance over existing methods in both visual quality and instruction adherence. The source code is publicly available.
📝 Abstract
Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem-real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. Our code is made publicly available here.
Problem

Research questions and friction points this paper is trying to address.

Editing single images with diffusion models without overfitting
Maintaining original content while adding new language-guided features
Enabling high-resolution image generation from single input images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based guidance for single-image editing
Patch-based fine-tuning for arbitrary resolution
Distilling knowledge into pre-trained diffusion models