π€ AI Summary
Addressing the dual challenges of modeling usersβ aesthetic preferences and mitigating the cold-start problem for new fashion items, this paper proposes an aesthetics-driven dual-attribute graph modeling framework. It jointly leverages fine-grained visual attributes (extracted from images) and semantic textual features to construct ID-agnostic, denoised user representations with strong cross-item generalization capability. Methodologically, we introduce the first prompt-guided, multimodal large model (LLM + VLM)-based approach for fine-grained attribute extraction; design a graph neural network to capture high-order interactions among users, attributes, and items; and incorporate a noise-robust mechanism for user interest graph construction. Evaluated on the IQON3000 dataset, our method significantly outperforms ID-based baselines, achieving substantial gains in recommendation accuracy under cold-start conditions. Results validate the effectiveness and generalizability of explicit aesthetic perception modeling coupled with deep multimodal fusion.
π Abstract
Personalized fashion recommendation is a difficult task because 1) the decisions are highly correlated with users' aesthetic appetite, which previous work frequently overlooks, and 2) many new items are constantly rolling out that cause strict cold-start problems in the popular identity (ID)-based recommendation methods. These new items are critical to recommend because of trend-driven consumerism. In this work, we aim to provide more accurate personalized fashion recommendations and solve the cold-start problem by converting available information, especially images, into two attribute graphs focusing on optimized image utilization and noise-reducing user modeling. Compared with previous methods that separate image and text as two components, the proposed method combines image and text information to create a richer attributes graph. Capitalizing on the advancement of large language and vision models, we experiment with extracting fine-grained attributes efficiently and as desired using two different prompts. Preliminary experiments on the IQON3000 dataset have shown that the proposed method achieves competitive accuracy compared with baselines.