๐ค AI Summary
Existing multimodal recipe recommendation methods treat multimodal features merely as auxiliary inputs to ID-based representations, neglecting deep semantic relationships among recipesโleading to weak collaborative signals and failure to explicitly model cross-modal semantic structures. To address this, we propose MAGRec, the first self-supervised framework that jointly introduces modality-specific graph construction and cross-modal representation disentanglement. Specifically, MAGRec constructs separate semantic graphs for textual and visual modalities, then leverages graph neural networks coupled with clustering-guided structured contrastive learning to enhance intra-modal semantics and align inter-modal relationships. Extensive experiments on multiple real-world datasets consistently outperform state-of-the-art methods, demonstrating that explicit modeling of cross-modal semantic structure is critical for improving both recommendation accuracy and personalization performance.
๐ Abstract
Food recommendation systems serve as pivotal components in the realm of digital lifestyle services, designed to assist users in discovering recipes and food items that resonate with their unique dietary predilections. Typically, multi-modal descriptions offer an exhaustive profile for each recipe, thereby ensuring recommendations that are both personalized and accurate. Our preliminary investigation of two datasets indicates that pre-trained multi-modal dense representations might precipitate a deterioration in performance compared to ID features when encapsulating interactive relationships. This observation implies that ID features possess a relative superiority in modeling interactive collaborative signals. Consequently, contemporary cutting-edge methodologies augment ID features with multi-modal information as supplementary features, overlooking the latent semantic relations between recipes. To rectify this, we present CLUSSL, a novel food recommendation framework that employs clustering and self-supervised learning. Specifically, CLUSSL formulates a modality-specific graph tailored to each modality with discrete/continuous features, thereby transforming semantic features into structural representation. Furthermore, CLUSSL procures recipe representations pertinent to different modalities via graph convolutional operations. A self-supervised learning objective is proposed to foster independence between recipe representations derived from different unimodal graphs. Comprehensive experiments on real-world datasets substantiate that CLUSSL consistently surpasses state-of-the-art recommendation benchmarks in performance.