SINE: SINgle Image Editing with Text-to-Image Diffusion Models

📅 2022-12-08

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 145

✨ Influential: 13

career value

203K/year

🤖 AI Summary

This work addresses fine-grained, text-guided editing of a single real-world image using pretrained diffusion models—challenged by overfitting, semantic distortion, and resolution constraints during single-image fine-tuning. To tackle these issues, we propose a model-driven, classifier-free guidance mechanism that enhances controllability of linguistic instructions throughout the generative process. Additionally, we introduce a patch-based fine-tuning strategy enabling efficient knowledge distillation from a single image and supporting arbitrary-resolution editing. Our approach preserves the original image’s structural integrity and semantic consistency while significantly improving editing fidelity and quality across diverse tasks—including style transfer, content addition, and object manipulation. Extensive experiments demonstrate superior performance over existing methods in both visual quality and instruction adherence. The source code is publicly available.

📝 Abstract

Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem-real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. Our code is made publicly available here.

Problem

Research questions and friction points this paper is trying to address.

Editing single images with diffusion models without overfitting

Maintaining original content while adding new language-guided features

Enabling high-resolution image generation from single input images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based guidance for single-image editing

Patch-based fine-tuning for arbitrary resolution

Distilling knowledge into pre-trained diffusion models

🔎 Similar Papers

TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer