Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of simultaneously preserving global structure and local semantic details in fashion image generation by proposing the LOTS framework, which introduces a multi-level local-global joint guidance mechanism. LOTS jointly encodes a global sketch and multiple local text-sketch pairs within a shared latent space and fuses these multimodal conditions via attention mechanisms during the diffusion denoising process. To support this approach, the authors construct Sketchy, the first fashion dataset annotated with multiple text-sketch pairs, encompassing both professional and amateur sketches. Experimental results demonstrate that LOTS significantly outperforms existing methods in both structural fidelity and local semantic accuracy, exhibiting strong generalization across both “in-the-wild” and professionally drawn sketches. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an"in the wild"split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.

Problem

Research questions and friction points this paper is trying to address.

fashion image generation

sketch-text pairing

localized conditioning

multi-level guidance

structure-preserving generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-level conditioning

localized text-sketch pairing

diffusion model guidance