🤖 AI Summary
To address cold-start and data sparsity challenges in collaborative filtering, this paper proposes a lightweight, scalable multimodal recommendation framework. Methodologically: (1) a gated multimodal fusion module dynamically weights image and text features to adaptively mitigate inter-modal quality variance; (2) a two-layer LightGCN encoder—without nonlinear transformations—models the user-item interaction graph to efficiently capture high-order collaborative signals. The framework balances performance, efficiency, and interpretability: it significantly outperforms state-of-the-art collaborative filtering and multimodal GNN baselines on Amazon multimodal benchmarks, achieving substantial gains in Top-K metrics (e.g., Recall@20, NDCG@20). Crucially, it reduces parameter count and computational overhead by orders of magnitude, enabling practical large-scale deployment.
📝 Abstract
Multimodal recommendation has emerged as a promising solution to alleviate the cold-start and sparsity problems in collaborative filtering by incorporating rich content information, such as product images and textual descriptions. However, effectively integrating heterogeneous modalities into a unified recommendation framework remains a challenge. Existing approaches often rely on fixed fusion strategies or complex architectures , which may fail to adapt to modality quality variance or introduce unnecessary computational overhead. In this work, we propose RLMultimodalRec, a lightweight and modular recommendation framework that combines graph-based user modeling with adaptive multimodal item encoding. The model employs a gated fusion module to dynamically balance the contribution of visual and textual modalities, enabling fine-grained and content-aware item representations. Meanwhile, a two-layer LightGCN encoder captures high-order collaborative signals by propagating embeddings over the user-item interaction graph without relying on nonlinear transformations. We evaluate our model on a real-world dataset from the Amazon product domain. Experimental results demonstrate that RLMultimodalRec consistently outperforms several competitive baselines, including collaborative filtering, visual-aware, and multimodal GNN-based methods. The proposed approach achieves significant improvements in top-K recommendation metrics while maintaining scalability and interpretability, making it suitable for practical deployment.