Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This paper addresses three key challenges in multimodal product relationship modeling: (1) difficulty in inferring substitutability and complementarity, (2) high noise in user behavioral data, and (3) severe long-tail sparsity. To tackle these issues, we propose the Multimodal Self-Supervised Collaboration (MMSC) framework. Methodologically, MMSC innovatively integrates multimodal foundation models with self-supervised behavioral learning; leverages LLM-generated augmented data for enhanced denoising; and introduces a hierarchical representation aggregation mechanism to achieve deep alignment between semantic features and task objectives. The framework incorporates graph neural networks, multimodal representation learning, self-supervised learning, and generative data augmentation. Extensive experiments on five real-world datasets demonstrate substantial improvements: substitutability recommendation gains 26.1% and complementarity recommendation gains 39.2% in main metrics, while cold-start product modeling also achieves significant performance gains.

Technology Category

Application Category

📝 Abstract

We introduce a novel self-supervised multi-modal relational item representation learning framework designed to infer substitutable and complementary items. Existing approaches primarily focus on modeling item-item associations deduced from user behaviors using graph neural networks (GNNs) or leveraging item content information. However, these methods often overlook critical challenges, such as noisy user behavior data and data sparsity due to the long-tailed distribution of these behaviors. In this paper, we propose MMSC, a self-supervised multi-modal relational item representation learning framework to address these challenges. Specifically, MMSC consists of three main components: (1) a multi-modal item representation learning module that leverages a multi-modal foundational model and learns from item metadata, (2) a self-supervised behavior-based representation learning module that denoises and learns from user behavior data, and (3) a hierarchical representation aggregation mechanism that integrates item representations at both the semantic and task levels. Additionally, we leverage LLMs to generate augmented training data, further enhancing the denoising process during training. We conduct extensive experiments on five real-world datasets, showing that MMSC outperforms existing baselines by 26.1% for substitutable recommendation and 39.2% for complementary recommendation. In addition, we empirically show that MMSC is effective in modeling cold-start items.

Problem

Research questions and friction points this paper is trying to address.

Infer substitutable and complementary items from noisy user behavior data

Address data sparsity due to long-tailed user behavior distribution

Improve cold-start item modeling with multi-modal representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised multi-modal relational learning framework

Multi-modal foundational model for item metadata

Hierarchical aggregation of semantic and task representations

🔎 Similar Papers

Fine-tuning Multimodal Large Language Models for Product Bundling