LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing drag-and-drop image editing methods rely on implicit point matching via attention mechanisms, leading to limited inversion strength, expensive test-time optimization (TTO), and difficulty in simultaneously achieving high-fidelity inpainting and text-guided generation. This paper introduces the first drag-and-drop editing paradigm based on explicit correspondence maps, enabling geometry-aware control in multimodal diffusion Transformers without implicit matching—supporting full-strength inversion, multi-step editing, and synchronized translation-scaling operations. By integrating explicit correspondence mapping with attention-enhanced control, our method eliminates TTO entirely and unifies geometric manipulation with text guidance. On DragBench, it significantly outperforms state-of-the-art methods: drag accuracy and perceptual quality (VIEScore) improve markedly, and human evaluation confirms superior fidelity in inpainting and precise semantic controllability in generation—establishing a new state of the art.

Technology Category

Application Category

📝 Abstract
The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.
Problem

Research questions and friction points this paper is trying to address.

Eliminates reliance on implicit point matching in drag editing
Enables stable full-strength inversion without test-time optimization
Unifies precise geometric control with text guidance capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit correspondence map for reliable reference
Stable full-strength inversion without optimization
Unifies geometric control with text guidance
🔎 Similar Papers
No similar papers found.
Z
Zixin Yin
The Hong Kong University of Science and Technology, StepFun
Xili Dai
Xili Dai
UC Berkeley; HKUST
computer vision
Duomin Wang
Duomin Wang
senior researcher, Stepfun
computer vision
X
Xianfang Zeng
StepFun
L
Lionel M. Ni
The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology
G
Gang Yu
StepFun
Heung-Yeung Shum
Heung-Yeung Shum
Microsoft
Computer VisionComputer Graphics