🤖 AI Summary
Existing multimodal recommendation methods predominantly rely on static multimodal knowledge graphs (MKGs), limiting effective integration of heterogeneous information from knowledge graphs, multimodal item features, and user–item interaction graphs—thereby degrading representation quality and recommendation performance. To address this, we propose a dynamic graph-enhanced multimodal recommendation framework. First, we design a self-loop iterative fusion mechanism that dynamically refines the heterogeneous graph structure by feeding back item representations learned during historical training. Second, we introduce cross-modal semantic consistency learning to jointly align multimodal features (e.g., image and text) with the evolving graph structure. This approach overcomes the limitations of static graph modeling, significantly enhancing the discriminability and robustness of user and item representations. Extensive experiments on multiple benchmark datasets demonstrate consistent improvements: +3.2% average gain in Recall@20 and +2.8% in NDCG@20 over state-of-the-art multimodal recommendation models.
📝 Abstract
Knowledge graphs (KGs) and multimodal item information, which respectively capture relational and attribute features, play a crucial role in improving recommender system accuracy. Recent studies have attempted to integrate them via multimodal knowledge graphs (MKGs) to further enhance recommendation performance. However, existing methods typically freeze the MKG structure during training, which limits the full integration of structural information from heterogeneous graphs (e.g., KG and user-item interaction graph), and results in sub-optimal performance. To address this challenge, we propose a novel framework, termed Self-loop Iterative Fusion of Heterogeneous Auxiliary Information for Multimodal Recommendation (SLIF-MR), which leverages item representations from previous training epoch as feedback signals to dynamically optimize the heterogeneous graph structures composed of KG, multimodal item feature graph, and user-item interaction graph. Through this iterative fusion mechanism, both user and item representations are refined, thus improving the final recommendation performance. Specifically, based on the feedback item representations, SLIF-MR constructs an item-item correlation graph, then integrated into the establishment process of heterogeneous graphs as additional new structural information in a self-loop manner. Consequently, the internal structures of heterogeneous graphs are updated with the feedback item representations during training. Moreover, a semantic consistency learning strategy is proposed to align heterogeneous item representations across modalities. The experimental results show that SLIF-MR significantly outperforms existing methods, particularly in terms of accuracy and robustness.