CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing approaches struggle to effectively model users’ dynamic preferences and cross-modal synergies in multi-category, multi-context scenarios involving heterogeneous modalities such as text and images. To address this challenge, this work proposes the CAMMSR model, which introduces a novel category-guided Mixture-of-Experts (MoE) attention mechanism to enable dynamic and explicit multimodal fusion. The model further enhances cross-modal alignment through modality-swapping contrastive learning and leverages a category prediction auxiliary task to guide adaptive modality weighting, thereby improving personalization and contextual adaptability. Extensive experiments on four public datasets demonstrate that CAMMSR significantly outperforms state-of-the-art methods, validating its effectiveness in adaptive fusion, cross-modal collaboration, and user-centric recommendation.

Technology Category

Application Category

📝 Abstract

The explosion of multimedia data in information-rich environments has intensified the challenges of personalized content discovery, positioning recommendation systems as an essential form of passive data management. Multimodal sequential recommendation, which leverages diverse item information such as text and images, has shown great promise in enriching item representations and deepening the understanding of user interests. However, most existing models rely on heuristic fusion strategies that fail to capture the dynamic and context-sensitive nature of user-modal interactions. In real-world scenarios, user preferences for modalities vary not only across individuals but also within the same user across different items or categories. Moreover, the synergistic effects between modalities-where combined signals trigger user interest in ways isolated modalities cannot-remain largely underexplored. To this end, we propose CAMMSR, a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation. At its core, CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies. This component dynamically allocates modality weights guided by an auxiliary category prediction task, enabling adaptive fusion of multimodal signals. Additionally, we design a modality swap contrastive learning task to enhance cross-modal representation alignment through sequence-level augmentation. Extensive experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.

Problem

Research questions and friction points this paper is trying to address.

multimodal sequential recommendation

user-modal interaction

modality fusion

inter-modal synergy

category-guided preference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Sequential Recommendation

Mixture of Experts

Category-Guided Attention