COHESION: Composite Graph Convolutional Network with Dual-Stage Fusion for Multimodal Recommendation

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address inaccurate representation learning caused by data sparsity and modality noise in multimodal recommendation, this paper proposes, for the first time, a synergistic framework integrating modality fusion and graph representation learning. Methodologically: (1) it introduces a two-stage strategy—ID-embedding-guided early-stage modality refinement and semantic-level late-stage fusion—to suppress irrelevant information; (2) it constructs a composite graph convolutional network to jointly model heterogeneous (user-item) and homogeneous (user-user, item-item) relations; and (3) it incorporates an adaptive cross-modal optimization mechanism to ensure balanced multimodal representation learning. Extensive experiments on three benchmark datasets demonstrate state-of-the-art performance, with up to 12.7% improvement in Recall@20 over existing methods. The results validate the effectiveness and generalizability of the proposed fusion-representation mutual enhancement paradigm.

Technology Category

Application Category

📝 Abstract

Recent works in multimodal recommendations, which leverage diverse modal information to address data sparsity and enhance recommendation accuracy, have garnered considerable interest. Two key processes in multimodal recommendations are modality fusion and representation learning. Previous approaches in modality fusion often employ simplistic attentive or pre-defined strategies at early or late stages, failing to effectively handle irrelevant information among modalities. In representation learning, prior research has constructed heterogeneous and homogeneous graph structures encapsulating user-item, user-user, and item-item relationships to better capture user interests and item profiles. Modality fusion and representation learning were considered as two independent processes in previous work. In this paper, we reveal that these two processes are complementary and can support each other. Specifically, powerful representation learning enhances modality fusion, while effective fusion improves representation quality. Stemming from these two processes, we introduce a COmposite grapH convolutional nEtwork with dual-stage fuSION for the multimodal recommendation, named COHESION. Specifically, it introduces a dual-stage fusion strategy to reduce the impact of irrelevant information, refining all modalities using ID embedding in the early stage and fusing their representations at the late stage. It also proposes a composite graph convolutional network that utilizes user-item, user-user, and item-item graphs to extract heterogeneous and homogeneous latent relationships within users and items. Besides, it introduces a novel adaptive optimization to ensure balanced and reasonable representations across modalities. Extensive experiments on three widely used datasets demonstrate the significant superiority of COHESION over various competitive baselines.

Problem

Research questions and friction points this paper is trying to address.

Improves multimodal recommendation by enhancing modality fusion and representation learning

Reduces irrelevant information impact via dual-stage fusion strategy

Utilizes composite graph network to capture user-item relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stage fusion reduces irrelevant modality information

Composite graph network captures heterogeneous relationships

Adaptive optimization balances multimodal representations

🔎 Similar Papers

No similar papers found.