Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation

📅 2026-01-16

🏛️ ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing multimodal recommendation methods often struggle to model complex cross-modal relationships due to shallow fusion and asymmetric user-item representations. To address this, this work proposes CRANE, a novel framework that employs a Recursive Cross-modal Attention (RCA) mechanism to iteratively capture high-order modal dependencies and construct symmetric multimodal user profiles. CRANE further introduces a dual-graph architecture comprising a heterogeneous user-item graph and a homogeneous item-item graph, enabling joint optimization of behavioral and semantic information within a self-supervised contrastive learning framework. Extensive experiments on four real-world datasets demonstrate that CRANE consistently outperforms state-of-the-art methods, achieving an average improvement of 5% on key metrics while maintaining high efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Multimedia recommendation systems leverage user-item interactions and multimodal information to capture user preferences, enabling more accurate and personalized recommendations. Despite notable advancements, existing approaches still face two critical limitations: first, shallow modality fusion often relies on simple concatenation, failing to exploit rich synergic intra- and inter-modal relationships; second, asymmetric feature treatment—where users are only characterized by interaction IDs while items benefit from rich multimodal content—hinders the learning of a shared semantic space. To address these issues, we propose a Cross-modal Recursive Attention Network with dual graph Embedding (CRANE). To tackle shallow fusion, we design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space, effectively capturing high-order intra- and inter-modal dependencies. For symmetric multimodal learning, we explicitly construct users’ multimodal profiles by aggregating features of their interacted items. Furthermore, CRANE integrates a symmetric dual-graph framework—comprising a heterogeneous user-item interaction graph and a homogeneous item-item semantic graph—unified by a self-supervised contrastive learning objective to fuse behavioral and semantic signals. Despite these complex modeling capabilities, CRANE maintains high computational efficiency. Theoretical and empirical analyses confirm its scalability and high practical efficiency, achieving faster convergence on small datasets and superior performance ceilings on large-scale ones. Comprehensive experiments on four public real-world datasets validate an average 5% improvement in key metrics over state-of-the-art baselines. Our code is publicly available at https://github.com/MKC-Lab/CRANE.

Problem

Research questions and friction points this paper is trying to address.

multimodal recommendation

shallow modality fusion

asymmetric feature representation

cross-modal relationships

shared semantic space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Attention

Recursive Feature Refinement

Symmetric Multimodal Learning