Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Pre-trained text-guided diffusion models struggle to simultaneously preserve object identity fidelity and ensure natural scene integration during personalized image editing. Method: We propose a training-free latent-space optimization framework that leverages a pre-trained text-to-image diffusion model as an energy function. Our approach employs dual-granularity guidance: coarse-grained text energy enforces global semantic consistency, while fine-grained point-wise image feature energy ensures local structural alignment. Additionally, we introduce a latent-space content composition strategy to enhance target object identity preservation. Results: Extensive experiments demonstrate that our method significantly outperforms existing approaches on large-domain object replacement tasks. It achieves superior identity fidelity while enabling more natural scene integration, supporting high-quality, zero-shot personalized editing in complex scenes.

Technology Category

Application Category

📝 Abstract

The rapid advancement of pretrained text-driven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.

Problem

Research questions and friction points this paper is trying to address.

Seamless integration in personalized image editing

Identity consistency in complex scene editing

Optimization of edited images in latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for image editing

Coarse-to-fine strategy with energy guidance

Latent space content composition for consistency

🔎 Similar Papers

No similar papers found.