Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation

📅 2024-11-05
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal foundation models (MFMs) for sequential recommendation rely heavily on parameter-efficient fine-tuning (PEFT), yet overlook GPU memory constraints and training speed bottlenecks; frameworks like IISAN support only symmetric architectures and homogeneous encoders, limiting compatibility with advanced large language models (LLMs). Method: We propose IISAN-Versa—the first general-purpose adaptation framework supporting both symmetric and asymmetric MFMs for joint text–image/video modeling. It introduces a decoupled PEFT architecture integrating intra-modal adaptation and cross-modal alignment, alongside grouped layer pruning and dimension alignment to enable efficient LLM-scale text encoder integration. Contribution/Results: We empirically demonstrate positive scaling effects of encoder size. On the Microlens benchmark, IISAN-Versa achieves state-of-the-art performance and supports diverse downstream tasks—including title generation and image/video captioning. Code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Multimodal foundation models (MFMs) have revolutionized sequential recommender systems through advanced representation learning. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt these models, studies often prioritize parameter efficiency, neglecting GPU memory and training speed. To address this, we introduced the IISAN framework, significantly enhancing efficiency. However, IISAN was limited to symmetrical MFMs and identical text and image encoders, preventing the use of state-of-the-art Large Language Models. To overcome this, we developed IISAN-Versa, a versatile plug-and-play architecture compatible with both symmetrical and asymmetrical MFMs. IISAN-Versa employs a Decoupled PEFT structure and utilizes both intra- and inter-modal adaptation. It effectively handles asymmetry through a simple yet effective combination of group layer-dropping and dimension transformation alignment. Our research demonstrates that IISAN-Versa effectively adapts large text encoders, and we further identify a scaling effect where larger encoders generally perform better. IISAN-Versa also demonstrates strong versatility in our defined multimodal scenarios, which include raw titles and captions generated from images and videos. Additionally, IISAN-Versa achieved state-of-the-art performance on the Microlens public benchmark. We will release our code and datasets to support future research.
Problem

Research questions and friction points this paper is trying to address.

Adapting multimodal foundation models efficiently for sequential recommendation
Overcoming limitations of symmetrical models and identical encoders
Enhancing compatibility with both symmetrical and asymmetrical MFMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled PEFT structure for multimodal adaptation
Group layer-dropping with dimension transformation alignment
Versatile plug-and-play architecture for symmetrical/asymmetrical MFMs
🔎 Similar Papers
No similar papers found.