🤖 AI Summary
Existing fashion recommendation methods rely on implicit visual embeddings, which struggle to model user behavior and lack interpretability. This work proposes DualFashion, a novel architecture that, for the first time, enables joint image-and-text generation in generative fashion recommendation. Leveraging a dual diffusion Transformer, the model synchronously generates both item images and their corresponding textual descriptions, conditioned on structured attribute specifications and visual outfit context. To enhance cross-modal knowledge transfer, a text-augmented fine-tuning strategy is introduced. Evaluated on the iFashion and Polyvore-U datasets, DualFashion significantly outperforms state-of-the-art approaches in personalized fill-in-the-blank and generative outfit tasks, demonstrating superior performance in behavioral modeling, interpretability, and generation diversity.
📝 Abstract
Personalized generative recommender systems have emerged as a promising solution for fashion recommendation. However, existing methods primarily rely on implicit visual embeddings from historical interactions, which often contain preference-irrelevant information and result in insufficient user behavior modeling. Moreover, these models typically generate only item images, providing limited interpretability. To address these limitations, we propose DualFashion, a Dual-Diffusional Generative Fashion Recommendation Architecture that jointly models image and text modalities for personalized and explainable recommendation. DualFashion adopts a dual-diffusion Transformer with image and text branches, where structured attribute-level captions and visual outfit information are jointly used as conditioning signals to model user behavior. The proposed architecture produces both fashion item images and textual descriptions, ensuring visual compatibility while providing explicit semantic interpretability. Furthermore, we introduce a text-augmented fine-tuning strategy that enhances generation diversity and enables effective cross-modal knowledge transfer without incurring heavy computational costs. Extensive experiments on iFashion and Polyvore-U across Personalized Fill-in-the-Blank and Generative Outfit Recommendation tasks demonstrate that DualFashion achieves strong performance in behavior modeling, interpretability, and efficiency compared to state-of-the-art methods. Our code and model checkpoints are available at https://github.com/LinkMingzhe/DualFashion.