S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-guided image editing methods based on diffusion models suffer from coarse semantic control, inaccurate spatial localization, loss of subject identity and high-frequency details, and erroneous modifications to irrelevant regions due to semantic entanglement. To address these issues, we propose a Semantic-Spatial Dual-Decoupling Editing framework. Methodologically: (1) we introduce learnable text token embeddings to explicitly represent subject identity and impose feature-space orthogonality constraints to decouple identity from attribute semantics; (2) we incorporate object-mask-guided cross-attention to achieve spatially focused editing. Our approach requires only lightweight adaptation—no full-model fine-tuning—on pretrained text-to-image diffusion models. Extensive qualitative and quantitative evaluations demonstrate significant improvements over state-of-the-art methods in editing accuracy, identity preservation, and detail fidelity. Moreover, our framework successfully supports complex multi-attribute editing tasks, such as makeup transfer.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.
Problem

Research questions and friction points this paper is trying to address.

Enables precise semantic and spatial image editing
Preserves identity information during fine-grained editing
Disentangles concepts to avoid irrelevant region alterations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes model with learnable identity text token
Disentangles identity using orthogonality in text features
Guides editing with object masks on attention maps
🔎 Similar Papers
No similar papers found.
X
Xudong Liu
ModiFace Inc.
Z
Zikun Chen
Department of Computer Science, University of Toronto
R
Ruowei Jiang
Department of Computer Science, University of Toronto
Ziyi Wu
Ziyi Wu
University of Toronto
Deep LearningComputer Vision3D VisionRobotics
K
Kejia Yin
Department of Computer Science, University of Toronto
H
Han Zhao
Department of Computer Science, University of Illinois Urbana-Champaign
P
Parham Aarabi
Department of Computer Science, University of Toronto
Igor Gilitschenski
Igor Gilitschenski
Assistant Professor, University of Toronto
RoboticsMachine LearningComputer Vision