🤖 AI Summary
To address insufficient exploitation of raw multimodal features—such as images, textual descriptions, and nutritional information—in recipe recommendation, this paper proposes TESMR, a three-stage collaborative enhancement framework. First, content-level multimodal semantic representations are extracted using foundation models (e.g., CLIP and BERT). Second, a user-recipe interaction graph is constructed to enable relation-level message propagation. Third, learnable embedding-based contrastive learning is introduced to refine cross-modal alignment and enhance representation discriminability. TESMR systematically integrates content understanding, structural modeling, and representation learning, thereby significantly improving embedding quality. Extensive experiments on two real-world datasets demonstrate that TESMR achieves 7–15% absolute gains in Recall@10 over state-of-the-art methods, validating its effectiveness and advancement in deep multimodal feature utilization.
📝 Abstract
Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.