🤖 AI Summary
Existing multimodal recommendation systems directly utilize raw modality features to construct behavioral graphs, compromising the synergy between collaborative filtering and modality semantics while remaining vulnerable to modality noise. Moreover, their static, uniform modality–behavior alignment weights hinder effective representation fusion. To address these issues, we propose an enhanced multimodal recommendation framework. First, we leverage pretrained models to extract robust item representations and construct a semantically grounded item-association graph, effectively suppressing modality noise. Second, we introduce a two-level dynamic alignment mechanism: (i) entity-level adaptive weighting and (ii) training-step-aware progressive enhancement of overall alignment strength—enabling fine-grained, time-varying alignment between modality and behavioral representations. Integrating multimodal pretraining, graph neural networks, and end-to-end optimization, our method achieves significant improvements over state-of-the-art approaches across five benchmark datasets, demonstrating superior effectiveness and robustness.
📝 Abstract
MultiModal Recommendation (MMR) systems have emerged as a promising solution for improving recommendation quality by leveraging rich item-side modality information, prompting a surge of diverse methods. Despite these advances, existing methods still face two critical limitations. First, they use raw modality features to construct item-item links for enriching the behavior graph, while giving limited attention to balancing collaborative and modality-aware semantics or mitigating modality noise in the process. Second, they use a uniform alignment weight across all entities and also maintain a fixed alignment strength throughout training, limiting the effectiveness of modality-behavior alignment. To address these challenges, we propose EGRA. First, instead of relying on raw modality features, it alleviates sparsity by incorporating into the behavior graph an item-item graph built from representations generated by a pretrained MMR model. This enables the graph to capture both collaborative patterns and modality aware similarities with enhanced robustness against modality noise. Moreover, it introduces a novel bi-level dynamic alignment weighting mechanism to improve modality-behavior representation alignment, which dynamically assigns alignment strength across entities according to their alignment degree, while gradually increasing the overall alignment intensity throughout training. Extensive experiments on five datasets show that EGRA significantly outperforms recent methods, confirming its effectiveness.