AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Traditional point-driven image editing methods rely on iterative optimization or geometric transformations, suffering from low efficiency and difficulty in modeling semantic correlations, thus failing to fully exploit the editing potential of pre-trained diffusion models. This paper proposes the first single-step, point-driven editing framework that requires neither fine-tuning nor iteration. It directly mines cross-regional semantic correlations implicitly encoded in the U-Net self-attention mechanism during DDIM inversion, enabling adaptive generation of context-aware masks for precise, semantically consistent interactive editing. Its core innovation lies in the first explicit use of diffusion model self-attention as a semantic prior for point-guided editing. Experiments demonstrate that our method achieves state-of-the-art performance in both semantic consistency and localization accuracy across multiple benchmarks, while maintaining high visual fidelity and significantly outperforming mainstream approaches in editing speed—enabling real-time interaction.

Technology Category

Application Category

📝 Abstract

Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named AttentionDrag, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks.

Problem

Research questions and friction points this paper is trying to address.

Exploiting latent knowledge in diffusion models for editing

Improving semantic consistency in point-based image editing

Enhancing efficiency and precision in image manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages latent knowledge in diffusion models

Uses self-attention for semantic consistency

Adaptively generates masks for precise editing

🔎 Similar Papers

No similar papers found.