🤖 AI Summary
Existing text-driven methods excel at texture editing but offer coarse spatial control, whereas drag-based approaches achieve precise geometric deformation yet lack texture guidance. This paper introduces the first unified diffusion framework for joint text-and-drag-guided image editing. Our method addresses the complementary limitations of unimodal paradigms through three core contributions: (1) a deterministic point-cloud-based drag mechanism enabling accurate structural manipulation in latent space; (2) a drag-text collaborative denoising strategy that dynamically balances their conditional guidance via learnable weights; and (3) integrated 3D feature mapping and latent-space modulation to decouple and jointly optimize layout and appearance. Extensive experiments demonstrate that our approach achieves high fidelity and strong generalization across diverse editing tasks—including object repositioning, shape deformation, and texture refinement—surpassing or matching state-of-the-art unimodal methods in both quantitative metrics (e.g., LPIPS, CLIP-Score) and qualitative evaluation.
📝 Abstract
This paper explores image editing under the joint control of text and drag interactions. While recent advances in text-driven and drag-driven editing have achieved remarkable progress, they suffer from complementary limitations: text-driven methods excel in texture manipulation but lack precise spatial control, whereas drag-driven approaches primarily modify shape and structure without fine-grained texture guidance. To address these limitations, we propose a unified diffusion-based framework for joint drag-text image editing, integrating the strengths of both paradigms. Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising. Notably, our model supports flexible editing modes - operating with text-only, drag-only, or combined conditions - while maintaining strong performance in each setting. Extensive quantitative and qualitative experiments demonstrate that our method not only achieves high-fidelity joint editing but also matches or surpasses the performance of specialized text-only or drag-only approaches, establishing a versatile and generalizable solution for controllable image manipulation. Code will be made publicly available to reproduce all results presented in this work.