Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In text-guided image editing (TIE), fidelity and editability are inherently conflicting objectives; existing methods often suffer from over-editing or under-editing due to the absence of a unified optimization mechanism. This paper proposes UnifyEdit—a fine-tuning-free, unified latent diffusion framework that jointly optimizes structural fidelity and text alignment within the latent space of a pre-trained diffusion model. Its core innovation is an adaptive timestep scheduling mechanism that dynamically modulates the relative weights of self-attention (for preserving structural consistency) and cross-attention (for enforcing semantic alignment), thereby mitigating gradient conflicts between these competing objectives. Extensive experiments demonstrate that UnifyEdit significantly outperforms state-of-the-art methods across diverse editing tasks. Both quantitative metrics and qualitative evaluations confirm that UnifyEdit achieves a more robust and balanced trade-off between fidelity and editability.

Technology Category

Application Category

📝 Abstract

Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.

Problem

Research questions and friction points this paper is trying to address.

Balancing fidelity and editability in text-based image editing

Addressing gradient conflicts in attention-based constraints

Achieving robust structure preservation and text alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent diffusion model for tuning-free editing

Self-attention and cross-attention constraints for balance

Adaptive time-step scheduler to resolve gradient conflicts

🔎 Similar Papers

No similar papers found.

Authors to Follow