🤖 AI Summary
This paper addresses three key challenges in multimodal product relationship modeling: (1) difficulty in inferring substitutability and complementarity, (2) high noise in user behavioral data, and (3) severe long-tail sparsity. To tackle these issues, we propose the Multimodal Self-Supervised Collaboration (MMSC) framework. Methodologically, MMSC innovatively integrates multimodal foundation models with self-supervised behavioral learning; leverages LLM-generated augmented data for enhanced denoising; and introduces a hierarchical representation aggregation mechanism to achieve deep alignment between semantic features and task objectives. The framework incorporates graph neural networks, multimodal representation learning, self-supervised learning, and generative data augmentation. Extensive experiments on five real-world datasets demonstrate substantial improvements: substitutability recommendation gains 26.1% and complementarity recommendation gains 39.2% in main metrics, while cold-start product modeling also achieves significant performance gains.
📝 Abstract
We introduce a novel self-supervised multi-modal relational item representation learning framework designed to infer substitutable and complementary items. Existing approaches primarily focus on modeling item-item associations deduced from user behaviors using graph neural networks (GNNs) or leveraging item content information. However, these methods often overlook critical challenges, such as noisy user behavior data and data sparsity due to the long-tailed distribution of these behaviors. In this paper, we propose MMSC, a self-supervised multi-modal relational item representation learning framework to address these challenges. Specifically, MMSC consists of three main components: (1) a multi-modal item representation learning module that leverages a multi-modal foundational model and learns from item metadata, (2) a self-supervised behavior-based representation learning module that denoises and learns from user behavior data, and (3) a hierarchical representation aggregation mechanism that integrates item representations at both the semantic and task levels. Additionally, we leverage LLMs to generate augmented training data, further enhancing the denoising process during training. We conduct extensive experiments on five real-world datasets, showing that MMSC outperforms existing baselines by 26.1% for substitutable recommendation and 39.2% for complementary recommendation. In addition, we empirically show that MMSC is effective in modeling cold-start items.