🤖 AI Summary
Existing diffusion models struggle to simultaneously preserve global structure and enable precise local modifications in image editing, primarily due to the poor quality of initial latent variables obtained through inversion. This work proposes an active diffusion editing framework that embeds recoverable latent information directly into the generation process: by injecting the difference between clean and diffused latents along the diffusion trajectory and extracting this signal during inversion, the method reconstructs a resettable starting point that closely approximates the original generative latent. A lightweight optimization module further corrects VAE asymmetry bias. For the first time, this approach integrates a resettable latent mechanism into generation, enabling high-fidelity reconstruction of the editing starting point without storing the original latent. Evaluated on Stable Diffusion, it achieves superior structural consistency and detail fidelity across diverse text-guided editing tasks, outperforming state-of-the-art methods while remaining efficient and training-free, and is compatible with existing tuning-free techniques.
📝 Abstract
Recent advances in diffusion models have enabled high-quality image generation, leading to increasing demand for post-generation editing that modifies local regions while preserving global structure. Achieving such flexible and precise editing requires a high-quality starting point, a latent representation that provides both the freedom needed for diverse modifications and the precision required for fine-grained, region-specific control. However, existing inversion-based approaches such as DDIM inversion often yield unsatisfactory starting latents, resulting in degraded edit fidelity and structural inconsistency. Ideally, the most suitable editing anchor should be the original latent used during the generation process, as it inherently captures the scene's structure and semantics. Yet, storing this latent for every generated image is impractical due to massive storage and retrieval costs. To address this challenge, we propose ResetEdit, a proactive diffusion editing framework that embeds recoverable latent information directly into the generation process. By injecting the discrepancy between the clean and diffused latents into the diffusion trajectory and extracting it during inversion, ResetEdit reconstructs a resettable latent that closely approximates the true starting state. Additionally, a lightweight latent optimization module compensates for reconstruction bias caused by VAE asymmetry. Built upon Stable Diffusion, ResetEdit integrates seamlessly with existing tuning-free editing methods and consistently outperforms state-of-the-art baselines in both controllability and visual fidelity.