HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of a dedicated method for generating photorealistic garment images from flat fashion sketches—a critical gap in apparel production pipelines. We formally introduce the novel “Flat Sketch to Real Garment” (FS2RG) generation task, which faces two key challenges: insufficient visual supervision for fabric-level details and semantic misalignment between sketch geometry and textual prompts. To tackle these, we propose an end-to-end diffusion-based framework featuring a multimodal semantic enhancement module and a harmonized cross-modal attention mechanism, enabling dynamic structural-textural alignment and controllable fusion under either image- or text-biased conditioning. Our architecture integrates a CLIP text encoder, a UNet backbone, and a custom cross-modal alignment module. Evaluated on our large-scale, multi-modal garment dataset—Multi-modal Detailed Garment—it significantly outperforms state-of-the-art methods. User studies confirm substantial improvements in both material realism and structural fidelity.

Technology Category

Application Category

📝 Abstract
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic garment images from flat sketches and text
Addressing insufficient visual supervision for fabric details in diffusion models
Resolving conflicts between sketch and text guidance in garment synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal semantic enhancement for fabric representation
Harmonized cross-attention balances sketch and text
Generates sketch-aligned or text-guided garment outputs
🔎 Similar Papers
Junyi Guo
Junyi Guo
Cornell University
J
Jingxuan Zhang
Xi’an Jiaotong Liverpool University
F
Fangyu Wu
Xi’an Jiaotong Liverpool University
Huanda Lu
Huanda Lu
NingboTech University
AI
Q
Qiufeng Wang
Xi’an Jiaotong Liverpool University
Wenmian Yang
Wenmian Yang
Specially Appointed Associate Professor, Beijing Normal University at Zhuhai
Data MiningMachine LearningNatural Language ProcessingTime series
E
Eng Gee Lim
Xi’an Jiaotong Liverpool University
D
Dongming Lu
Zhejiang University