FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model

πŸ“… 2025-04-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low efficiency and weak interactivity in personalized outfit recommendation for fashion retail, this work proposes the first unified multimodal vision-language model (VLM) architecture tailored for multi-turn dialogue, supporting four core tasks: outfit recommendation, item substitution, garment image generation, and virtual try-on. Methodologically, we introduce FashionRecβ€”a large-scale, context-rich dataset comprising 330K samples spanning basic, personalized, and substitution-based recommendations; design a context-aware, progressive cross-task joint modeling framework; and integrate instruction tuning, diffusion-based image generation, and physics-simulation-driven virtual try-on. Experiments demonstrate significant improvements over state-of-the-art methods across multiple fashion understanding and generation benchmarks. User studies show a 32% increase in recommendation satisfaction and an average interaction naturalness score of 4.6/5, confirming strong practical applicability.

Technology Category

Application Category

πŸ“ Abstract
Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3's superior performance in recommendation effectiveness and practical value as a fashion assistant.
Problem

Research questions and friction points this paper is trying to address.

Enhancing fashion retail with multimodal vision-language interactions
Providing personalized and iterative fashion outfit recommendations
Enabling multitask capabilities like virtual try-on and image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified vision-language model for fashion tasks
Multimodal multitask multiround fashion assistant
Fine-tuned on 331K FashionRec dialogue samples
πŸ”Ž Similar Papers
No similar papers found.
K
Kaicheng Pang
Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China; School of Fashion and Textiles, Hong Kong Polytechnic University, Hong Kong SAR, China
Xingxing Zou
Xingxing Zou
School of Fashion and Textiles, The Hong Kong Polytechnic Univeristy
ai art
W
Waikeung Wong
Laboratory for Artificial Intelligence in Design, Hong Kong SAR, China; School of Fashion and Textiles, Hong Kong Polytechnic University, Hong Kong SAR, China