π€ AI Summary
This work addresses key challenges in multimodal sequential recommendation, including data sparsity, insufficient cross-modal semantic alignment, and the loss of fine-grained semantics from raw textual inputs. To tackle these issues, the authors propose DMESR, a dual-view enhancement framework that integrates two complementary strategies: first, it employs contrastive learning to align cross-modal semantics generated by a multimodal large language model (MLLM); second, it introduces a cross-attention fusion module to jointly model the coarse-grained semantics from the MLLM and the fine-grained information preserved in the original text. Extensive experiments on three real-world datasets demonstrate that DMESR significantly boosts the performance of three mainstream sequential recommendation models, thereby validating its effectiveness and strong generalization capability.
π Abstract
Sequential Recommender Systems (SRS) aim to predict users'next interaction based on their historical behaviors, while still facing the challenge of data sparsity. With the rapid advancement of Multimodal Large Language Models (MLLMs), leveraging their multimodal understanding capabilities to enrich item semantic representation has emerged as an effective enhancement strategy for SRS. However, existing MLLM-enhanced recommendation methods still suffer from two key limitations. First, they struggle to effectively align multimodal representations, leading to suboptimal utilization of semantic information across modalities. Second, they often overly rely on MLLM-generated content while overlooking the fine-grained semantic cues contained in the original textual data of items. To address these issues, we propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR). For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs. For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics. Finally, these two fused representations can be seamlessly integrated into the downstream sequential recommendation models. Extensive experiments conducted on three real-world datasets and three popular sequential recommendation architectures demonstrate the superior effectiveness and generalizability of our proposed approach.