LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing cross-domain sequential recommendation (CDSR) methods suffer from three key limitations: inaccurate user behavior prediction, weak modeling of cross-domain preference transfer, and insufficient coupling of intra- and inter-sequence item relationships. To address these challenges, this paper proposes a multimodal-enhanced CDSR framework. It freezes the CLIP model to extract joint image-text embeddings, leverages a large language model (LLM) to enhance textual semantic representation, and designs a multi-attention mechanism to jointly model domain-specific sequential dynamics and cross-domain preference migration. Crucially, LLM-derived knowledge is explicitly injected into the multimodal fusion architecture to enable cross-domain semantic alignment and joint modeling of sequential relationships. Extensive experiments on four e-commerce datasets demonstrate that our method consistently outperforms state-of-the-art baselines, validating both the effectiveness and generalizability of multimodal collaborative modeling for CDSR.

Technology Category

Application Category

📝 Abstract

Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences and capturing both intra- and inter-sequence item relationships. We propose LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation (LLM-EMF), a novel and advanced approach that enhances textual information with Large Language Models (LLM) knowledge and significantly improves recommendation performance through the fusion of visual and textual data. Using the frozen CLIP model, we generate image and text embeddings, thereby enriching item representations with multimodal data. A multiple attention mechanism jointly learns both single-domain and cross-domain preferences, effectively capturing and understanding complex user interests across diverse domains. Evaluations conducted on four e-commerce datasets demonstrate that LLM-EMF consistently outperforms existing methods in modeling cross-domain user preferences, thereby highlighting the effectiveness of multimodal data integration and its advantages in enhancing sequential recommendation systems. Our source code will be released.

Problem

Research questions and friction points this paper is trying to address.

Enhances cross-domain recommendation with multimodal data fusion

Improves item representation using LLM knowledge and visual-textual embeddings

Captures complex user preferences across domains via multiple attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-enhanced multimodal fusion for recommendations

Frozen CLIP model for embeddings generation

Multiple attention mechanism for cross-domain learning

🔎 Similar Papers

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation