ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing drag-based image editing methods struggle to effectively model fine-grained textures and semantic context from reference images, resulting in edited outputs with insufficient coherence and fidelity. This paper proposes a high-precision drag-editing framework that requires neither fine-tuning nor latent-space inversion. Its core innovations are: (1) a context-preserving token injection mechanism that explicitly transfers texture and structural information from the reference image; and (2) a position-consistent attention mechanism that integrates overlap-aware masking with spatial inverse mapping of VAE latent features to ensure precise spatial alignment between edited regions and surrounding context. Evaluated on the DragBench-SR and DragBench-DR benchmarks, our method comprehensively outperforms existing state-of-the-art approaches, achieving significant improvements in visual quality, semantic consistency, and detail fidelity.

Technology Category

Application Category

📝 Abstract
Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhances drag-based image editing precision
Preserves contextual details and texture fidelity
Eliminates interference from irrelevant reference features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting noise-free reference features via CTI and LRM
Applying Position-Consistent Attention with overlap-aware masking
Leveraging VAE-encoded features without finetuning or inversion
🔎 Similar Papers
No similar papers found.
Huiguo He
Huiguo He
South China University of Technology
P
Pengyu Yan
South China University of Technology
Z
Ziqi Yi
South China University of Technology, Kuaishou Technology
Weizhi Zhong
Weizhi Zhong
University of Hong Kong
Text-to-Image Generation
Z
Zheng Liu
Kuaishou Technology
Y
Yejun Tang
Kuaishou Technology
H
Huan Yang
Kuaishou Technology
Kun Gai
Kun Gai
Senior Director & Researcher, Alibaba Group
Machine LearningComputational Advertising
G
Guanbin Li
Shenzhen Loop Area Institute
Lianwen Jin
Lianwen Jin
Professor of Electronic and Information Engineering, South China University of Technology
Optical Character Recognition (OCR)Computer VisionDocument AIMultimodal LLMs