🤖 AI Summary
This work addresses the limitations of existing multimodal sequential recommendation methods, which suffer from constrained semantic representation due to small frozen encoders and ineffective fusion of collaborative filtering signals. Directly fine-tuning vision-language models (VLMs) often leads to modality collapse, where a single modality dominates optimization. To overcome these challenges, the authors propose an embedding architecture built upon a high-capacity VLM, complemented by weak-modality-penalized contrastive learning to rectify gradient imbalance and a cross-modal relational topology regularizer to preserve geometric consistency across modalities. This approach achieves, for the first time, balanced multimodal fusion that is aware of collaborative filtering signals. Extensive experiments demonstrate consistent and significant improvements over state-of-the-art methods across diverse scenarios, with notable gains in both recommendation accuracy and robustness.
📝 Abstract
Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.