CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing

📅 2024-10-04
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
In image editing, purely text-based methods yield coarse semantic descriptions, while drag-based approaches suffer from ambiguous spatial localization—compromising both precision and flexibility. To address this, we propose the first dual-signal collaborative editing framework that jointly leverages textual semantic guidance and drag-point localization for fine-grained, unambiguous local editing on diffusion models. Methodologically: (1) we introduce a text–drag cross-modal alignment mechanism using CLIP to jointly model semantic and geometric signals; (2) we design global–local motion supervision with directional constraints to explicitly model drag-point displacement trajectories; and (3) we incorporate a fast point-tracking strategy to accelerate convergence. Extensive experiments across diverse editing tasks demonstrate significant improvements over unimodal baselines: +23.6% in editing accuracy and +19.4% in semantic consistency, as measured by FID and CLIP-Score.

Technology Category

Application Category

📝 Abstract
Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed extbf{CLIPDrag}, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.
Problem

Research questions and friction points this paper is trying to address.

Combines text and drag signals for precise image editing
Addresses ambiguity in drag-based and text-based editing methods
Improves convergence speed with a fast point-tracking method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines text and drag signals
Uses global-local motion supervision
Introduces fast point-tracking method
🔎 Similar Papers
No similar papers found.
Z
Ziqi Jiang
Zhejiang University
Z
Zhen Wang
Zhejiang University
L
Long Chen
The Hong Kong University of Science and Technology