OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing fashion intelligence systems struggle to establish a unified visual-semantic structure due to fragmented tasks and incomplete annotations, limiting their generalization and reasoning capabilities. To address this, this work introduces FashionX, a million-scale dataset with fine-grained annotations, and proposes OmniFashion, a unified vision-language framework that, for the first time, integrates diverse tasks—including retrieval, recommendation, recognition, and dialogue—into a consistent conversational paradigm to enable cross-task collaborative reasoning. Leveraging a hierarchical attribute annotation scheme from global to local, large-scale vision-language pretraining, and multi-task learning, OmniFashion achieves state-of-the-art performance across multiple benchmarks, significantly improving task accuracy and cross-task generalization, thereby advancing the development of general-purpose fashion intelligence.

Technology Category

Application Category

📝 Abstract
Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.
Problem

Research questions and friction points this paper is trying to address.

fashion intelligence
vision-language models
multi-task learning
incomplete annotations
fragmented supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language learning
multi-task learning
fashion intelligence
dialogue-based reasoning
fine-grained annotation
🔎 Similar Papers
No similar papers found.
Zhengwei Yang
Zhengwei Yang
Wuhan University | A*STAR
Computer VisionCausal InferenceRe-identification
A
Andi Long
National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University
Hao Li
Hao Li
Wuhan University
Computer VisionVisual Reasoning
Z
Zechao Hu
National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University
Kui Jiang
Kui Jiang
Harbin Institute of Technology
computer visionimage processingdeep learning
Zheng Wang
Zheng Wang
Wuhan University
Multimedia Content AnalysisComputer VisionArtificial Intelligence