🤖 AI Summary
This paper addresses the lack of domain-specific fashion priors and poor cross-style generalization in garment generation. We propose VLG, a vision-language-garment multimodal model. Methodologically, VLG is the first to adapt web-scale vision-language foundation models to the fashion domain via targeted fine-tuning, integrating image-text alignment, cross-modal attention, and garment-structure-aware prior modeling—trained end-to-end on large-scale web-sourced image-text pairs. Our contributions are threefold: (1) a unified end-to-end framework for text- and reference-image-guided garment generation; (2) zero-shot cross-style and cross-prompt generation capability, achieving significant improvements over baselines on unseen styles and complex textual descriptions; and (3) empirical validation that general-purpose multimodal foundation models can be efficiently transferred to vertical fashion design tasks, establishing a new paradigm for domain-specialized multimodal generative modeling.
📝 Abstract
Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.