🤖 AI Summary
Existing fashion image generation methods primarily target narrow tasks such as virtual try-on, failing to capture dynamic poses, diverse scenes, and narrative expression inherent in professional fashion editing. This paper introduces “virtual fashion photoshoot”—a novel task that transforms standardized garment images into atmospheric, story-driven, magazine-quality editorial images. Methodologically, we propose an automated cross-domain text-image alignment pipeline integrating vision-language reasoning and object-level localization to achieve precise matching between garment images and lookbook-style reference images. Our core contribution is the first large-scale Garment–Lookbook Image Pair (GLIP) dataset, comprising 10K high-, 50K medium-, and 300K low-quality aligned pairs, with multi-granularity annotations (e.g., garment attributes, scene context, stylistic cues). This dataset bridges the semantic gap between e-commerce product imagery and fashion media content, advancing fashion image generation from function-oriented applications toward artistic, narrative-driven synthesis.
📝 Abstract
Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.