Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper addresses the lack of domain-specific fashion priors and poor cross-style generalization in garment generation. We propose VLG, a vision-language-garment multimodal model. Methodologically, VLG is the first to adapt web-scale vision-language foundation models to the fashion domain via targeted fine-tuning, integrating image-text alignment, cross-modal attention, and garment-structure-aware prior modeling—trained end-to-end on large-scale web-sourced image-text pairs. Our contributions are threefold: (1) a unified end-to-end framework for text- and reference-image-guided garment generation; (2) zero-shot cross-style and cross-prompt generation capability, achieving significant improvements over baselines on unseen styles and complex textual descriptions; and (3) empirical validation that general-purpose multimodal foundation models can be efficiently transferred to vertical fashion design tasks, establishing a new paradigm for domain-specialized multimodal generative modeling.

Technology Category

Application Category

📝 Abstract

Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

Problem

Research questions and friction points this paper is trying to address.

Exploring garment generation in specialized domains using multimodal models

Synthesizing garments from text and visual inputs with VLG model

Assessing zero-shot generalization for unseen garment styles and prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLG model synthesizes garments from text and images

Assesses zero-shot generalization for unseen styles

Transfers web-scale reasoning to fashion design

🔎 Similar Papers

No similar papers found.