CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address the challenge of efficiently adapting multimodal foundation models (MFMs) to sequential recommendation—particularly when integrating more than two raw modalities (e.g., text and images)—this paper proposes CROSSAN, a plug-and-play cross-modal side adapter network. Methodologically, CROSSAN introduces: (1) a fully decoupled side adapter paradigm that eliminates full-parameter fine-tuning; (2) a Mixture-of-Modality-Experts Fusion (MOMEF) mechanism enabling cross-modal parameter sharing and dynamic gated fusion; and (3) integration with parameter-efficient fine-tuning (PEFT) to drastically reduce computational overhead. Empirically, CROSSAN consistently improves recommendation performance across four distinct MFMs on public benchmarks, significantly outperforming state-of-the-art baselines. The implementation code and datasets will be made publicly available.

Technology Category

Application Category

📝 Abstract

Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities (e.g., text, images, audio, videos, etc.). As recommender systems increasingly incorporate these modalities, leveraging MFMs to generate better representations has great potential. However, their application in sequential recommendation remains largely unexplored. This is primarily because mainstream adaptation methods, such as Fine-Tuning and even Parameter-Efficient Fine-Tuning (PEFT) techniques (e.g., Adapter and LoRA), incur high computational costs, especially when integrating multiple modality encoders, thus hindering research progress. As a result, it remains unclear whether we can efficiently and effectively adapt multiple (>2) MFMs for the sequential recommendation task. To address this, we propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN). Leveraging the fully decoupled side adapter-based paradigm, CROSSAN achieves high efficiency while enabling cross-modal learning across diverse modalities. To optimize the final stage of multimodal fusion across diverse modalities, we adopt the Mixture of Modality Expert Fusion (MOMEF) mechanism. CROSSAN achieves superior performance on the public datasets for adapting four foundation models with raw modalities. Performance consistently improves as more MFMs are adapted. We will release our code and datasets to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Efficiently adapting multiple multimodal foundation models for sequential recommendation

Reducing computational costs in multimodal model adaptation

Enabling cross-modal learning across diverse raw modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play Cross-modal Side Adapter Network

Mixture of Modality Expert Fusion mechanism

Efficient adaptation of multiple multimodal foundation models

🔎 Similar Papers

No similar papers found.