🤖 AI Summary
This work addresses the challenges of multimodal recommendation under sparse user feedback and long-tailed data distributions, where representation entanglement and modality imbalance often degrade performance. To mitigate these issues, the authors propose MAGNET, a novel framework that explicitly assigns modalities into three distinct roles—dominant, balanced, and complementary—via a modality-guided mixture-of-experts architecture over graphs. MAGNET employs an entropy-triggered two-stage routing mechanism to dynamically balance expert coverage and specialization, and integrates a dual-view graph learning module that fuses interaction graphs with content-induced edges. By combining interaction-conditioned routing and structure-aware augmentation, the framework achieves adaptive and interpretable multimodal fusion. Extensive experiments demonstrate that MAGNET significantly outperforms state-of-the-art methods across multiple benchmark datasets, with particularly notable gains in sparse and long-tailed scenarios.
📝 Abstract
Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective fusion both crucial and challenging. Existing approaches often rely on shared fusion pathways, leading to entangled representations and modality imbalance. To address these issues, we propose \textbf{MAGNET}, a \textbf{M}odality-Guided Mixture of \textbf{A}daptive \textbf{G}raph Experts \textbf{N}etwork with Progressive \textbf{E}ntropy-\textbf{T}riggered Routing for Multimodal Recommendation, designed to enhance controllability, stability, and interpretability in multimodal fusion. MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both \emph{what} to fuse and \emph{how} to fuse are explicitly controlled and interpretable. At the representation level, a dual-view graph learning module augments the interaction graph with content-induced edges, improving coverage for sparse and long-tail items while preserving collaborative structure via parallel encoding and lightweight fusion. At the fusion level, MAGNET employs structured experts with explicit modality roles -- dominant, balanced, and complementary -- enabling a more interpretable and adaptive combination of behavioral, visual, and textual cues. To further stabilize sparse routing and prevent expert collapse, we introduce a two-stage entropy-weighting mechanism that monitors routing entropy. This mechanism automatically transitions training from an early coverage-oriented regime to a later specialization-oriented regime, progressively balancing expert utilization and routing confidence. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines.