DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

📅 2024-09-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate editing region localization and loss of garment texture details in fashion image editing, this paper proposes a multimodal-guided diffusion model. Methodologically, it (1) introduces the first text-driven, Grounded-SAM–based precise editing region localization; (2) designs a decoupled cross-attention mechanism jointly with an auxiliary U-Net to enable effective texture injection and refinement; and (3) constructs the first extended VITON-HD dataset featuring texture-rich image–text pairs. The method integrates multi-source guidance—including textual descriptions, human pose, region masks, and texture images—within a latent diffusion framework, significantly improving local editing accuracy and texture photorealism. Quantitative and qualitative evaluations demonstrate that our approach outperforms state-of-the-art methods across both image fidelity and multimodal consistency metrics. This work establishes a new paradigm for fine-grained fashion image editing.

Technology Category

Application Category

📝 Abstract
Fashion image editing is a crucial tool for designers to convey their creative ideas by visualizing design concepts interactively. Current fashion image editing techniques, though advanced with multimodal prompts and powerful diffusion models, often struggle to accurately identify editing regions and preserve the desired garment texture detail. To address these challenges, we introduce a new multimodal fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit). DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images. To precisely locate the editing region, we first introduce Grounded-SAM to predict the editing region based on the user's textual description, and then combine it with other conditions to perform local editing. To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism. Specifically, this mechanism employs a decoupled cross-attention layer to integrate textual descriptions and texture images, and incorporates an auxiliary U-Net to preserve the high-frequency details of generated garment texture. Additionally, we extend the VITON-HD dataset using a multimodal large language model to generate paired samples with texture images and textual descriptions. Extensive experiments show that our DPDEdit outperforms state-of-the-art methods in terms of image fidelity and coherence with the given multimodal inputs.
Problem

Research questions and friction points this paper is trying to address.

Accurately identifying editing regions in fashion images
Preserving garment texture details during image editing
Integrating multimodal inputs for coherent fashion generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Grounded-SAM for precise editing region localization
Implements decoupled cross-attention for multimodal condition integration
Employs auxiliary U-Net to preserve high-frequency texture details
🔎 Similar Papers
No similar papers found.