From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address insufficient exploitation of raw multimodal features—such as images, textual descriptions, and nutritional information—in recipe recommendation, this paper proposes TESMR, a three-stage collaborative enhancement framework. First, content-level multimodal semantic representations are extracted using foundation models (e.g., CLIP and BERT). Second, a user-recipe interaction graph is constructed to enable relation-level message propagation. Third, learnable embedding-based contrastive learning is introduced to refine cross-modal alignment and enhance representation discriminability. TESMR systematically integrates content understanding, structural modeling, and representation learning, thereby significantly improving embedding quality. Extensive experiments on two real-world datasets demonstrate that TESMR achieves 7–15% absolute gains in Recall@10 over state-of-the-art methods, validating its effectiveness and advancement in deep multimodal feature utilization.

Technology Category

Application Category

📝 Abstract

Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.

Problem

Research questions and friction points this paper is trying to address.

Effectively leveraging multimodal features for recipe recommendation systems

Progressively refining raw features into effective embeddings through three stages

Enhancing recommendation performance beyond basic user-recipe interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage framework refining raw multimodal features

Content enhancement using foundation models

Relation enhancement via message propagation

🔎 Similar Papers

Multi-modal Food Recommendation using Clustering and Self-supervised Learning