🤖 AI Summary
Existing fashion datasets are often fragmented and limited to single tasks, hindering expert-level holistic understanding of style, occasion, and outfit coordination logic. To address this, this work introduces a multimodal benchmark dataset annotated by fashion experts, featuring fine-grained semantic labels for both individual garments and complete outfits. It further proposes the first expert knowledge–driven unified framework for fashion understanding, enabling three core tasks: outfit-to-item grounding, outfit completion, and semantic evaluation. By integrating expert annotation protocols, multimodal large language model training, and context-aware compatibility modeling, the proposed approach achieves significant performance gains across multiple tasks, demonstrating the dataset’s effectiveness as both a unified benchmark and a valuable training resource.
📝 Abstract
Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.