Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation

📅 2026-01-16
🏛️ ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal recommendation methods often struggle to model complex cross-modal relationships due to shallow fusion and asymmetric user-item representations. To address this, this work proposes CRANE, a novel framework that employs a Recursive Cross-modal Attention (RCA) mechanism to iteratively capture high-order modal dependencies and construct symmetric multimodal user profiles. CRANE further introduces a dual-graph architecture comprising a heterogeneous user-item graph and a homogeneous item-item graph, enabling joint optimization of behavioral and semantic information within a self-supervised contrastive learning framework. Extensive experiments on four real-world datasets demonstrate that CRANE consistently outperforms state-of-the-art methods, achieving an average improvement of 5% on key metrics while maintaining high efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Multimedia recommendation systems leverage user-item interactions and multimodal information to capture user preferences, enabling more accurate and personalized recommendations. Despite notable advancements, existing approaches still face two critical limitations: first, shallow modality fusion often relies on simple concatenation, failing to exploit rich synergic intra- and inter-modal relationships; second, asymmetric feature treatment—where users are only characterized by interaction IDs while items benefit from rich multimodal content—hinders the learning of a shared semantic space. To address these issues, we propose a Cross-modal Recursive Attention Network with dual graph Embedding (CRANE). To tackle shallow fusion, we design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space, effectively capturing high-order intra- and inter-modal dependencies. For symmetric multimodal learning, we explicitly construct users’ multimodal profiles by aggregating features of their interacted items. Furthermore, CRANE integrates a symmetric dual-graph framework—comprising a heterogeneous user-item interaction graph and a homogeneous item-item semantic graph—unified by a self-supervised contrastive learning objective to fuse behavioral and semantic signals. Despite these complex modeling capabilities, CRANE maintains high computational efficiency. Theoretical and empirical analyses confirm its scalability and high practical efficiency, achieving faster convergence on small datasets and superior performance ceilings on large-scale ones. Comprehensive experiments on four public real-world datasets validate an average 5% improvement in key metrics over state-of-the-art baselines. Our code is publicly available at https://github.com/MKC-Lab/CRANE.
Problem

Research questions and friction points this paper is trying to address.

multimodal recommendation
shallow modality fusion
asymmetric feature representation
cross-modal relationships
shared semantic space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Attention
Recursive Feature Refinement
Symmetric Multimodal Learning
Dual Graph Learning
Contrastive Self-supervision
🔎 Similar Papers
No similar papers found.
J
Ji Dai
Beijing University of Posts and Telecommunications, China
Quan Fang
Quan Fang
Ph.D. of Institute of Automation of the Chinese Academy Sciences (CASIA)
Knowledge Graph Data MiningMultimediaSocial Media
Jun Hu
Jun Hu
School of Computing, National University of Singapore, NUS
D
Desheng Cai
Tianjin University of Technology, China
Y
Yang Yang
Beihang University, China and State Key Laboratory of CNS/ATM, China
Can Zhao
Can Zhao
Nvidia
medical image analysis